In this three-part blog series on Text to Speech (TTS), we explore three major players in Cloud TTS: Google, Microsoft, and Amazon. We take a deep dive deep into the services offered by each of these platforms and outline specific selection criteria utilized to compare them. Today, we’ll focus specifically on TTS options offered by Microsoft Azure Text to Speech.

Selection Criteria for TTS

To provide a common criterion for review of the three major Cloud TTS providers, we will use the following guidelines. Be sure to see our other blogs for comparisons of the other two major cloud providers: Amazon Polly, Google Text to Speech.

Ease of Use
1. How easy can you get into the application and get it to work?
2. What technical skills do you need to do this?
Coverage
1. What voices and voice options are available?
2. What variants and genders are available?
Special Add-ons and Technology Available
1. Neural TTS [high quality]?
2. Special markup tags or other features?
Overall Quality
1. The quality of TTS is very subjective so it is important to have a clearly defined goal so that you can rate the quality when you are comparing options.
For example, if you need Finnish, you might only have one choice: a female voice from Microsoft. But in some cases, such as French, you may have up to 10 options between all 3 providers with multiple genders and accents. Your rating on quality will directly correlate to the original goals you outlined.

Microsoft Azure Text to Speech

TTS Options offered by Microsoft Azure Text to Speech provides a reliable, if somewhat less exciting collection, of voices with a very interesting mix of back-end abilities. In this platform offering, you can customize and augment the speech service–which is packaged closely with Speech to Text, Text to Speech, and Speech to Speech. Respectively, these are different approaches to transcription and speech generation. Basically, they skip the entire process of anything written – just talk in one language and it will talk back in the other. So, while it looks similar to Google’s offering, there’s some potential golden nuggets, hidden in the architecture.

Ease of Use – The TTS Options offered by Microsoft Azure Text to Speech offer a similar approach to Google when it comes to availability for using their product. Namely, you’ll need to get an API and be a developer or have access to one. As a large developer-run organization, it makes sense that the TTS options on their cloud platform share tame terminology and usage as their sophisticated tools for developers–not end-users. To get it to work, they do provide some sample code but it assumes you know something about C# or Python. So, for the basic user, this is probably not the best option.
Coverage – Microsoft Azure Text to Speech is probably the oldest entry for TTS and they even have a Windows-based SAPI [speech API] for screen readers and other software in Windows. The service lists support for 49 languages and 81 voices. It is worth noting that Microsoft offers some languages that are not available on AWS or Google–so in terms of coverage this is the broader list. With 81 voices though, it’s somewhat shallower than the others; meaning sometimes you only have a single voice to choose from.
Microsoft’s Special Add-ons – The speech service is able to be trained with additional data for 9 major languages. This opens the door to customization and allows you to take a regular TTS voice and truly model one for yourself. This however is “gated technology” meaning it’s potential for abuse and other ethical concerns means that you have to invest heavily [$100k+ USD] and get approval from Microsoft for your use.
Quality: The 5 Neural TTS offerings are almost a start–but they don’t really have the breadth of offerings as the other vendors. Overall, they are acceptable for the cases where no alternative exists. There remains a lot of variability but overall the non-neural voices [most of them] are not as high quality as others since they rely on parametric models–and are mostly on-par with Google’s non-WaveNet offerings.

We believe your stakeholders and end customers should help guide what’s appropriate and best for your specific application. Sometimes having a voice, while not perfect, is better than none while, at other times, it’s worth the additional cost and effort for a studio recording.

If you want to learn more about TTS, voice talent, localization, and translation services, Global eLearning is widely considered to be a leader in TTS, especially for the learning and development industry. Contact us today to get started!

Gilbert Segura is the CTO at Global eLearning. Learn more about Gilbert!