“Text to speech systems (TTS) were first developed to aid the visually impaired by offering a computer-generated spoken voice that would “read” text to the user,” according to Webopedia. With today’s rapidly growing technology, TTS is quickly becoming a lower cost alternative to voiceovers. While voice actors evolve and are often subject to availability and price, the TTS robots are always available and adapting options and pricing, by the minute. TTS is always improving – adding voices, genders, and features. This constant evolution is also what is changing the landscape of major players in Cloud TTS. In this blog series on TTS, we outline the selection criteria for TTS and take a deep dive deep into the services offered by three of the major Cloud TTS vendors. In this blog, we’ll focus specifically on Text To Speech options offered by Amazon Polly. In upcoming blogs, we’ll compare these features to those offered by Microsoft and Google. Let’s explore some basic criteria for evaluating Cloud TTS options offered by Amazon Polly.
Selection Criteria for TTS
To provide a common criterion for review of the major Cloud TTS vendors, we will use the following guidelines. Be sure to see our comparisons of the other two major TTS providers in other blogs: Google and Microsoft Azure.:
- Ease of Use
- How easy can you get into the application and get it to work?
- What technical skills do you need to do this?
- What voices and voice options are available?
- What variants and genders are available?
- Special Add-ons and Technology Available
- Neural TTS [high quality]?
- Special markup tags or other features?
- Overall Quality
- The quality of TTS is very subjective so it is important to have a clearly defined goal so that you can rate the quality when you are comparing options.
- For example, if you need Finnish, you might only have one choice: a female voice from Microsoft. But in some cases, such as French, you may have up to 10 options between all 3 providers with multiple genders and accents. Your rating on quality will directly correlate to the original goals you outlined.
In our opinion, the TTS Options offered by Amazon Polly are the easiest to use. Polly provides a TTS engine with a web-based console as part of the standard Amazon Web Services (AWS) offering. Like Google and Microsoft, the Text to Speech is related to their voice-enabled assistant, in this case, Alexa. Most of the capabilities of Alexa are available in the Polly application, but not all. It’s also important to consider that Alexa has an entire programmatic arena that has no relation to TTS, so while part of the improvements are shown it’s not the same. Like most US-based companies, English has the most features.
- Ease of use: This really is the easiest TTS engine to use out of the 3. Polly allows you to login with an AWS account, select the Polly application, and begin. It is also the only vendor that has an integrated Text-to-Speech interface. [figure 1]. The interface allows you to pick a language, type of voice, and download it. From a simple “I don’t need to modify anything” approach, this is perfect. Unfortunately, not all languages/TTS tasks are that straightforward.
The SSML tab allows for special markup to be added. However, once you are doing that type of markup, you have to do more than just Copy/Paste. The interface is really the main differentiator between the two different approaches we see in TTS. The consumer approach, where you can just paste your text and go; and the more labor-intensive method that requires you to examine if things are actually being produced properly.
2. Coverage: Amazon Polly provides a variety of different voices in multiple languages for synthesizing speech from text. As of the time of this writing, Polly currently has 93 voices in 29 languages. Here is a list of voices in Amazon Polly; with a link to the technical documentation on their supported vocabularies.
3. Amazon’s Special Add-ons: Amazon Polly provides a relatively recent set of additions:
- Neural “news reader” style for English and US Spanish that gives the voice a more natural-sounding cadence/rhythm. A new “conversational” format was also recently added just for English.
- Compression – here the voice can be programmed to be within a certain duration–say it in 10.5 seconds and the speech will compress to fit without sounding like a chipmunk but, maybe a bit like an auctioneer.
- Bilingual voices – for Hindi; if you have a text that goes between English and Hindi you don’t have to worry that it will try to pronounce one as the other. This is useful since you typically have to code around each language for each part in that language.
- Other add-ons appear frequently throughout the year. These changes are typically driven by the innovations/users in the Alexa-side of things but they sometimes cross over.
4. Quality – Here we don’t have a clear-cut winner. The audio sounds good but, sometimes words are mistaken for their homonyms. Those are the tricky words that sounds the same but mean different things. This is a common problem in many languages since the context doesn’t always exist to clearly say this is a musical instrument vs. this is a fish [Bass vs Bass]. In English, Polly can be told which sense, and in any language, you can mark it up in SSML. Some other areas where we have seen feedback is for Japanese, in particular, where there are pauses between words. It’s important in how you say things by spacing the words properly with small breaks, this again is fixed in SSML.
A general note with quality and TTS: Some languages are well received, while others are met with heavy criticism, even when using the same voice selection, on different occasions. TTS, in some languages, is just not accepted as readily or, in others, the voice quality is considered poor. So, it’s always a good idea to know your audience and determine if they find it acceptable.
If you want to learn more about TTS, voice talent, localization and translation services, Global eLearning is widely considered to be a leader in TTS, especially for the learning and development industry. Contact us today to get started!
Gilbert Segura is the CTO at Global eLearning. Learn more about Gilbert!