Tips on audio localization: synthetic vs. real voices

By Ben Warren May 23, 2012

The proliferation of games and elearning over the past few years has had a dramatic effect on localization companies. From traditionally translating training information in documentation form, usually in enormous quantities, the rise of multimedia trainings has meant that localization companies, including our own, have had to adapt to a new way of thinking and of working. Nowhere has this been seen more than the rise of audio-related projects.

Over the past five years alone, the number of audio projects that have been included in translation project requests to localization companies have increased year on year. Now multimedia projects composed of audio, video or any other motion-capture formats need a dedicated full-time team to ensure that clients receive the expertise and professionalism they would except from any straightforward translation project. If the golden tenets of traditional translation have always been to stick to the source material, keep the translation memory (TM) clean and preserve the voice of the author, when it comes to audio projects these rules are there to be bent. And none more so than in the first choice any client has to make: considering humans vs. machine voices. So let’s see what that choice entails.

The humans are dead

The first thing to bear in mind in any elearning audio project, be it with synthetic or real voices, is how the majority of elearning projects are created. Most learning content management systems (LCMS) are designed the same way. There are visual files — maybe Flash-based, Captivate-based or simple videos — and then the audio files. The LCMS collates all these individual files and pulls them together to be displayed or heard based on a specific naming system.

This means a localization company working on audio projects needs to decide how to cope with hundreds of individual audio files usually only a few seconds long. Straightaway you can see the potential costs and engineering time needed for audio tasks like these, and to some extent, that is where synthetic voices can be useful.

Synthetic voices have been around for years. Indeed, the first machine to create a “real” synthesized voice was built in the 1950s. By 1961, scientists had managed to program a machine that could “sing” the song “Daisy Bell,” a feat that just happened to have been seen by the author Arthur C. Clarke, which in turn inspired him to use it in his novel 2001: A Space Odyssey to represent the moment when the all-powerful computer Hal slowly reverts back to its own creation. The scene in Kubrick’s film version where Hal sings “Daisy, Daisy, give me your answer do,” was the most famous representation of synthetic voices for many years. However, today, your phone can tell you in a soothing voice if you need to buy milk. How technology moves on.

The basic science behind the tools is known as Text-To-Speech or TTS for short. This is the starting point for all synthesized voices. In simple terms, the machine developed in the 1950s and represented today by software tools that work on an ordinary PC or over the internet took the written word and turned it into an audio representation. Today the major players in TTS can offer both male and female voices in all major languages and, sometimes, dialects. This instantly gives a company access to a wide range of markets without having to source voices in different countries.

But while the basic idea might be easy to follow, the technology behind it and the rules governing it are understandably more complex. Especially when you have to start thinking about phonetics. For those involved at the cutting edge of TTS, years of study comprising both phonetic language skills and intelligent programming are needed. There is slightly more to it than simply typing in some words and pressing record. So let’s see how a localization company needs to think about TTS in practical terms.

As previously mentioned, the first major benefit of TTS is its suitability to handle large numbers of files in batches. If your elearning module has 250 separate audio files, each of them just a few words long, then TTS can be an interesting starting point. Most TTS tools allow the user to create multiple input files as basic text files. This means that users can edit and double check all of their text before they create the audio files themselves. Once they are happy with the text, they can simply batch process the input files, which will create audio files of the same name, using the text in each file. For those projects where time, budget and engineering abilities are a constraint, this way of working could be an interesting alternative to using real people. The end user generates hundreds of files, already named, with the needed text. No third party is required, the process is quick and easy, and many files can be produced and engineered by almost anyone with a PC and a synthetic voice.

Say what you see

As mentioned before, however, experts in TTS have years of training, a huge wealth of linguistic and phonetic experience, and can “see” language in another way. So the main problem is, have you got someone in your organization who thinks the same way? If you want “perfect” speech, or a voice that pronounces words in the way you think they should be pronounced, then you will need a linguistic expert to work through the input text to tweak and shape it. In an ideal world there would be someone who is fully conversant with phonetics, and can type using phonetic characters to ensure all words spoken sound correct. Pronunciation is probably the biggest engineering issue with synthetic voices. For example, in a typical elearning module, you might expect to hear the word “module” more than once, perhaps the sentence: “In this module, we will learn…” In one project I can think of, switching between two synthetic voices, one male, one female, meant we could hear that the voices pronounced the word differently. Ideally, at this point your phonetic expert would step in and rewrite the code to make sure the sounds are changed. But what do you do if you don’t have a phonetic expert in the language you happened to be working in? Paying an outsider or freelancer would be expensive and negate any cost savings made by using TTS in the first place.

One of the most common solutions therefore is to “cheat” by creating new words. In this case, by changing the way the module was written (changing it to mod-yule in simple typing terms) would make it sound correct, while with the other voice maybe a simple case of playing with the intonation using punctuation would do the trick. Even so, it is this sort of fine tweaking that really can only be done by a native speaker of the language, and this is probably the biggest drawback to synthetic voices. The technology is undoubtedly superb, and getting better all the time — and if speed of production, quantity and (sometimes) price are more important than overall quality of sound, then TTS is a perfectly adequate solution for a company searching for multiple audio files. Just bear in mind it’s not always as simple as typing, listening and signing it off. And what about those instances where you have time restrictions on the audio files? Where you need to sync to an image or cue point? While TTS does have the ability to alter the speed of the voice talking, the level of programing involved, and therefore the time needed to produce files and check them, doubles at a stroke. So beware! This leads us nicely to what is probably the main drawback of TTS: it might be smarter, it might be faster and it might even sometimes be cheaper, but it still isn’t a real person!

Mankind fights back

So if synthetic voices are not the choice for you, then what are your options? Well, if synthetic speech started in the 1960s, real human voices can at least claim to have been around for a little bit longer, so that’s one point for them. It also means there are no limitations. If you want a male or female voice in any language, with any accent and in any dialect, it is possible to make it happen. If you want passion, comedy, anger or any emotion at all, you can make it happen. And even if you want it to sound like someone famous is narrating your elearning module, it can happen. So long as you pay for it, and pay well.

But, of course, first you have to find the voice talent. Today anyone with a microphone can claim to be a voiceover artist or recording voice talent. It doesn’t necessarily mean you would want to listen to them. When it comes to real voices, perhaps the most important aspect to consider is the role of the vendor manager and quality assurance (QA) team from the very beginning. From my experience, it can take years of work and negotiation to find talented, readily available voice artists to add to a professional roster.

Obviously the QA process of voice talents is vital. Any potential clients wanting to create an elearning module in a new language need to have a selection of voices from which to choose. They want to feel that they have the final say over the tone and style of voice being used, rather than just have a voice thrust upon them that might not always suit the end product. And that is why it is vital that you should offer them a selection of top quality voices. My recommended approach has always been to try to have the voices record a sample piece of text of my choosing, rather than use the voice talent’s own show reels. This allows potential clients to listen from a level playing field of samples. Of course, it also means that it is necessary to have a strict vetting process to ensure that only the best quality voices are in the database. Voices selected must be of high quality in richness of sound and energy, and also from an engineering point of view. When you combine the selection of voices, male and female, across all major and most minor languages, then the scale and importance of the voice database becomes obvious. I have found a balance between using standalone voice talents in their own studios, and partnering with recording studios around the world, which has allowed us access to professional voices who might never have considered working on small or medium-sized elearning projects. Instead of a dry presentation style, we have been able to offer voices with the experience of working on major films or advertising campaigns. Although, of course, if dry presentation style is what is needed, then you should be able to offer that too.

However, even when the voice has been chosen, there is still the matter of the large amount of files. If the benefit of synthetic voices was the ability to produce multiple small files easily, surely the cost and time of doing it for real would be prohibitive. It can work something like this: for every minute spent in the studio, by the time an engineer has opened a new file, processed it, post-edited it, named it and saved it, the voice talent has recorded a grand total of 16 words. Extrapolate that across hundreds or even thousands of files or modules, and the scale of the task — and potential cost — becomes clear. Our response to this issue was to develop an internal process and editing tool called Echo in order to help us record and post-process the hundreds of audio files needed for each project quickly and reliably (Figure 1). As all voices in the database have been tested on the process, it meant that a realistic (and affordable) pricing structure, as well as potential deadlines, could be proposed to clients, which in turn means a simple and effective working practice for the voice talents themselves. The man hours needed to process, QA, and in many cases, upload and check, hundreds of small audio files was always going to be prohibitive. Finding the balance between using technology and a process that cuts down unneeded time-consuming tasks has really been the key to the growth in this sector.

Another major benefit of using real voices is their flexibility. For localization companies working on elearning pro-jects or videos, it is rare indeed that it is necessary to actually create any product from scratch. Usually the visuals and concept already exist and just the audio or language has to change. This has its own problems in that in some cases it is not possible to alter the length or speed of the visuals. If the video to be worked on is 30 minutes long, with visuals triggered by the text, then it is not simply a case of recording new audio and dropping it in. Now you are in the world of time restrictions and sync points. So if you cannot change the time, speed or length of the visuals, what can you do? Put simply, you can instruct your real voices to speak to a certain length, or in extreme cases even translate and rework the text to fit. While this is possible with some TTS tools — you can change voice speed to fit the restrictions — you can become distracted with an overly mechanical voice, altered again for speed. Again, think of Hal in 2001, or try insulting your phone until it shouts abuse at you to see what I mean.

In reality, this means a lot of preparation and work up front before recording even starts to make sure everything works out. In most projects with restrictions, these things are discussed, worked out and solved before translation even begins. If a script is prepared properly from day one, then the amount of time needed on pickups is drastically reduced, also helping to save time and keep costs down. Real voice talents have the flexibility to keep to the time restrictions, rework text if needed, and have — at least if they have been vetted properly — the ability to think on their feet and make any changes needed to make things work.

This flexibility is also useful when thinking about retakes or last-minute changes that the client wants. For all audio projects, I would recommend the creation of a pronunciation guide. This is a validated set of terms or phrases with clear instructions from the client on how to say a particular word, name or acronym. Having this in place before recording starts is also a good way to avoid unnecessary pickups. However, for cases where there simply isn’t time to produce a validated pronunciation guide, then one thing that many of the voice talents I work with can offer is alternative pronunciations for certain terms. This simply means that the voice talents produce two or three takes for a single file, with all variations of how a certain name or term could be pronounced. The client can then simply choose the preferred pronunciation, and it isn’t necessary to go back into the studio for the sake of a few word pickups.

The key to success in any audio project is like any other project — planning ahead. The more information you can provide to voice talents before they start, and the more preprocessing you can do to an audio script to make it water tight for timings, then the fewer problems you will have once your audio files return. The cost of pickups is high because studio time doesn’t come cheap, so it is worth the extra effort at the start of a project to avoid them.

Sounds like a good choice

So after all that, the question still remains: what method should you use for an audio project? To make a decision, think about who the end audience is. If this is an elearning module for training a handful of staff about how to use the advanced features of the new accountancy software developed by resource and development, then you probably don’t need to have it narrated by Martin Sheen in the style of Apocalypse Now. Synthetic voices are probably the fastest, cheapest route for this kind of project. If the overall quality isn’t perfect, it’s not a major issue, as the end product will only be seen in-house, and the end user is probably spending more time noting how to deal with a tax drawdown in three different markets across four accounting time zones to think that the voice sounds funny. However, if your product is for sale on the open market and if you want to add value, then I would recommend going for real voices. The energy and emotion that they contribute can make the difference between a successful and interesting product, and something that slips from the mind the moment it finishes.

Ask yourself after making this decision if you know how to contact, manage, prepare and direct these voice talents, and then process, edit and deal with the audio files once they have finished. If the answer to all these questions is yes, then well done and good luck! But if after all that you are still not sure, then the next best thing to do is to speak to a professional who can help make those decisions with you.