At the upcoming 41st Annual LocWorld conference, I am excited to be presenting on the topic of using Text To Speech [TTS] entitled: “Robot Voices in Multimedia: Adventures in Text-to-Speech.”

The discussion will be about the current TTS landscape (Google, Polly, Azure), capabilities and limitations and where TTS is useful and where it is not recommended. This will be part of the technical portion of the conference which will cover new and emerging localization/translation technologies such as TTS. I’ve previously covered why to use TTS in a previous article on eLearning Industry.

The world of TTS is an emerging landscape with innovations every few months. It’s interesting to note that while being a fairly familiar tool, TTS doesn’t get much attention regarding how it can completely replace a voice artist, in some situations. New technologies add a dimension of “humanity” to TTS voices by making them sound very much like people in most cases. These technologies have removed many of the rough edges and provide a seemingly realistic bit of audio where you need it–at the ease of copy-pasting a phrase into an application. For example, earlier this year Amazon came out with Neural TTS for Polly. Its main new feature is a “newscaster style” which seems to have more of the natural flow of broadcast news. It’s only available for the English languages [UK, US, AU, IN] but, if history is any judge, there’s soon to be announcements for other languages as the technology is further implemented.

Besides the history and theory of text-to-speech, our discussion at the conference will cover how to implement these new technologies. Things such as SSML or Speech Synthesis Markup Language is something that is not commonly done in eLearning. These techniques are specific to not only the language but also the speech engine as Amazon, Google and Microsoft all have slightly different views on how things are done and what options are available.

A simple example of when to use such a tool is when you need to clear up ambiguous words such as bass and bass. One is a fish, one is a musical instrument. Or maybe the word “read” as in “I read the book last week” versus “Did you read that book?” How can you tell those apart? SSML provides a vocabulary to make sounds, Amazon actually has a “part of speech” for some English words that prove tricky.

Other examples include how you would spell out a phone number so it is read as a string of digits or how you would say 1st when you’re doing an ordered list of numbers? It should come as no surprise that using language as always is not a simple problem to solve with a machine and it’s very evident in the Text to Speech world.

Many of these things are taken for granted when speaking – and this is where the opportunity lies to scale with TTS. Currently, authors are limited in how precise they can be–and that’s when experienced TTS practitioners come into play. With the right conditions, an engineer can craft the correct sounds quickly, and for less time and effort, than it would take to get a human voice actor. If the source material changes, or if there is suddenly 10x the work, you can get more engineers anywhere in the world to access the engine. If you had to rely on the one “voice actor”, you would have a scheduling and logistics nightmare as well as the costs, their time and ability to do all that work–compromises would have to be made.

At Global eLearning, we have been pushing the boundaries of TTS limits and surmounting challenges over the past few years with book summaries, eLearning courses, and videos in hundreds of different cases. So, we look forward to sharing our expertise with you at the upcoming 41st Annual LocWorld Conference. I hope you will take the time to join us!

If you’d like to have a one-on-one conversation, contact Global eLearning today and let’s schedule a consultation. Or, attend the presentation at the 41st Annual LocWorld conference – I’ll see you there!

Robot Voices in eLearning: 41st Annual LocWorld Conference

Gilbert Segura