Understanding Automated Speech Recognition

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Automated Speech Recognition

Imagine if you had gone through the trouble of setting up a training program for how you sold things. You went  through getting the screenshots, the product images, the time to walk through what and why—it took ages. Let’s add that you started recording a video, and you got a product manager to actually take their time to go through the product. You get hours of recorded video for in-depth sales training.

While all this is perfectly paced and done in English, you want to share it with Germany and France and China—you are going to go global with this material. One little detail—the spontaneity and live interaction with the passionate product manager—it’s not scripted. You did it on the fly—it was natural and a few takes later, you were done. How can you get that done in another language?

This is Transcription 

Traditionally, you get a person to write it all down—stick it on a page and time it for you. This is Transcription. It’s how you capture what was said and make sure that you can translate it and have it ready for the voice actor in German to speak it too. Transcription takes a little bit of time and some money too—you have to basically listen to the entire audio two or more times to make sure that you are hearing everything correctly. In our example it was 8 hours—that’s 20 hours to be safe [2.5x] of work.

Transcription into English is actually not that hard or expensive but if you think about it, it’s a little bit slow and painful. In some cases we just skip it and go straight to German or French—but in cases where we want to keep the source text around [what about Italian, or Arabic next year?] we want to make an English copy. 20 hours is also about 2.5 days of delay in a schedule—so it slows things down waiting for the script to start.

Related Resources

1) The 7 most common localization challenges >>
2) Driving Student Engagement >>
3) What is Authentic Localization >>
4) Innovating eLearning Localization >>

But, wait, there is another solution?  Global eLearning utilizes Automatic Speech Recognition [ASR]!

Text to Speech

Having a computer recognize sounds and find words for them has come a long way; especially in the last 3 years. Think of Text to Speech—or the Google Voice, Alexa or Siri who can listen to you talk, and piece it together. If you have used such services you know they are “approximate” at times—not a perfect model. Our ASR solution uses a Machine Learning Model and is also mostly accurate, it’s not 100% but it helps immensely. In our case that’s OK. We still want to format the lines, we need to make sure they break where it’s natural [scene editing] and there’s not too much being said in short time; because even with a perfect transcript we still have to make words fit a line of dialogue properly.

Segmentation

We call this Segmentation. It’s about making sure that while the English says 120 words a minute, the French doesn’t have to try and speak 150. You might not know, but some languages take a few more letters/sounds to say the same thing; in software that just means broken menus; when you have someone doing a script, that might mean 10 extra minutes of talking that there is no video. The result?

The funniest examples are badly dubbed 70’s Kung Fu action movies where you could obviously tell there are about 90 more things being said than “I see” or some other simple phrase. Thankfully, those days are now long gone—video editing expectations [and practice] have come very far; we now work hard to have scripts that fit the audio. We have those scripts because we segment them to allow for changes and editing.  Adding that to the transcription exercise can be more effort and time—but with the ASR method we combine the two.

Back to our challenge with 8 hours and 20 hours to transcribe…

Time Savings

We use ASR to make a rough pass [at 80-90%] accuracy; it catches things like “uhmm” and small miscues. No matter what, we need to listen to the audio and create a good script. The time savings is this—even though the computer model misses some words so you know you have to listen to 8 hours, you listen to make it fit the segmentation, the script, the timing and form.

This results in 10 hours of work. That’s 50% of the time with just the transcript. In fact, it’s a better result and better accuracy from one-person; the script is ready to go and not just the transcript.

At the end of the project we ran through about 800 minutes of transcription and edited it into scripts spending less time than if we had gone through the traditional approach. This is just one of the many small innovations we find daily at Global eLearning.

Get Started Today