In the complicated process of creating audio scripts for learning and development, the special requirements for multilingual versions are often overlooked or put on the back-burner. The act of synchronizing content is getting the sounds of the voice actor more or less in time with the action on the screen. For an animated eLearning module, it’s not very hard to have a scene and text play out with good timing. The fact is that in most eLearning where you have an audio timeline, the hardest part is moving around all the other times to make things work. If a bit of audio takes 1.5 minutes instead of 1.2, there’s a slider for that. But when you have live-action video with someone talking, that’s more complicated. For this piece, I will examine the differences between lip syncing vs. phrase syncing.

First, let’s start by understanding the applications where synchronizing content becomes important. Let’s say that as a learning and development professional, you’re tasked with creating content for training. After developing the content in English, meetings with stakeholders and subject matter experts, gathering assets like b-roll video and audio tracks, setting up timing and sequence, you’re finally ready to distribute the content. But, now, you need to convert that content for consumption by employees in your French or German locations. How will the messaging, content, and flow need to change and will you have the time and resources to commit to this often complex additional work?

One area that often gets overlooked is just how the voiceover material, in other languages, will be conveyed and comprehended. In a previous blog, we discussed the topic of language expansion. Language expansion happens when you translate from English to another language. This process typically ‘expands’ the content, typically making it between 20-30% longer. When it comes to written text, you can adjust font sizes, text boxes and images among other things to make up for the expansion. However, when it comes to voiceover, the challenges become more difficult and the tighter the sync, the more difficult it is to make it work. Let’s explore lip syncing vs. phrase syncing.

Phrase Syncing

This is the predominant form of making audio match a given text and time. Using phrase syncing is relatively cheaper since you only have to get “most” of the screen time correct–there’s an expectation that the person talking is maybe not saying this in the audio language. Documentaries where someone famous is talking usually start out with their audio, but in the first few seconds you start hearing the voice actor. This is sometimes called “BBC Style” narration, where you want some of the original at the start but a gradual overlay of what is locally understood.

The synchronization is not 100% to the millisecond, but it’s approximately accurate for the subject matter and is much more forgiving when you have the inevitable text expansion. One phrase may take 3 words in English but 7 in Spanish–or more; this means the audio is too long. Some multimedia engineers can play with the audio to shorten it a little, or stretch the video–but these tricks only work in small amounts. We usually have to write the script with fewer words–and balance that with still saying the same things.

In Movies, if you have ever read subtitles and noticed they are not 100% accurate, there are sometimes trade-offs where the time needed by the text and the time to read it just don’t work. There is some creative liberty in the movie business and there are many rounds of consultation for a theatrical release. For an eLearning video; we have rounds of review before the recording to make sure that any edits/changes are approved–if the script is altered it has to be approved beforehand.

Lip Syncing

This method of audio recording is much more complicated than just getting the timing right–you have to get the mouth movements right. Each word spoken has the effect of making a shape on the face of the speaker; an “O” is obviously not an “M” and this adds complexity to the process. Typically, the cost is much higher on the script creation side–and it’s really no longer a translation, it’s more creative since you have to substitute the right words to make the sounds match. There is some software that can move the face around but this process is also very time consuming.

With a specialized script you need to get additional reviews and input–so the times for producing this content are higher; it’s more art than science. The voice actor, ironically enough, doesn’t have to work much harder on this type of material. All the hard work is done either at the script creating stage or the post-production where each frame of each mouth movement is timed to a few frames to avoid making it look off.

If lip sync sounds highly specialized, slow and expensive, it’s because it is. Comparatively speaking, a tight phrase sync will satisfy the majority of all requirements and still be around ½ the time and effort.

In summary, the two types of voice recordings for video are governed by the script and the source material. One is more flexible and generally acceptable. Lip sync is for higher production budgets and longer timelines—and, as you can see, it can basically be a production in itself.

If you need help with localizing and synchronizing content, contact the L&D experts on localization at Global eLearning.com today or download our Free Guide to Localization.

Matt Patterson is the VP of Global Sales. Learn more about Matt!