In this three-part blog series on Text to Speech (TTS), we explore three major players in Cloud TTS: Google, Microsoft, and Amazon. We take a deep dive deep into the services offered by each of these platforms and outline specific selection criteria utilized to compare them. In this blog, we’ll focus specifically on TTS options offered by Google Text to Speech.
Selection Criteria for TTS
To provide a common criterion for reviewing the three major cloud providers, we will use the following guidelines:
- Ease of Use
- How easy can you get into the application and get it to work?
- What technical skills do you need to do this?
- What voices and voice options are available?
- What variants and genders are available?
- Special Add-ons and Technology Available
- Neural TTS [high quality]?
- Special markup tags or other features?
- Overall Quality
- The quality of TTS is very subjective so it is important to have a clearly defined goal so that you can rate the quality when you are comparing options.
- For example, if you need Finnish, you might only have one choice: a female voice from Microsoft. But in some cases, such as French, you may have up to 10 options between all 3 providers with multiple genders and accents. Your rating on quality will directly correlate to the original goals you outlined.
TTS Options offered by Google Text to Speech
Google has undoubtedly invested a significant amount of talent and published the most cutting-edge research on making the most natural-sounding neural text to speech (NTTS). They are a little slower to make their products available for commercial use, but they do have a selection of WaveNet voices that use their neural text to speech research. Note, however, that research came out in 2016. When we talk about the back end of TTS options offered by Google Text to Speech, it’s all “back-end.” Users need an API key, an authenticated service user, some tokens, permissions, and a bit of code to get started. This is offered as a developer’s tool rather than a streamlined user-facing platform. Unless you know how to code or have someone who does, it’s off-limits to the casual user.
The demo should not be used for production purposes [against Terms of Service] but it does show you the basics of how the audio sounds and what voices are available.
- Ease of Use: This is where the harshest criticism lies in our review for both Google and Microsoft, when compared to the user-friendly interface from Amazon. It just doesn’t exist. The demo space is the closest thing to having an interface to sample the voices, which is what I recommend for sampling purposes. Ideally, I want to Copy + Paste the sample text to see if it’s worth the investment on the coverage and quality aspects.
- Coverage: Google has 32 languages and 187 total voices. They list them here on their reference page. There is a mix of WaveNet [Neural] and regular voices. The quality difference between the two is very apparent. Also, it’s worth noting is that the cost for WaveNet is 400% more than regular voices. Again, the WaveNet voices are not available for all and, as of this date/time, all have one Female–but coverage varies.
- Google’s Special Add-ons: The additional features all relate to adding media to the playback—which can be interesting for the times you have background audio/video elements and want to mix tracks. But, using an audio editing tool is probably easier for most users–this type of code-based mixing can be useful for those looking to scale prompts or other automated messages.
- Quality: Google’s WaveNet is very well received in multiple languages and 95 voices are supported for all languages except for Spanish European, which seems odd as Spanish is the second most generated language in use today.
If you want to learn more about TTS, voice talent, localization and translation services, Global eLearning is widely considered to be a leader in TTS, especially for the learning and development industry. Contact us today to get started!
Gilbert Segura is the CTO at Global eLearning. Learn more about Gilbert!