Speech: An old frontier and a new approach

S

peech is one of the fundamental modes of linguistic communication. It’s so close to unconscious, mainstream society only notices it when it’s missing.

We call our industry a “language industry,” yet the majority of all our tools and methods are text based. There are tools and standards for segmentation, terminology, spell-checking, grammar, orthographic conventions and many more, all for text. Even the most famous language tool from all of history, the Rosetta Stone, is a scribe’s aide — a rock with textual information in three written languages.

We have to also admit that spoken words haven’t been recorded for most of our history — they don’t survive very long after they are spoken. April 9, 1860, was the first time words were actually recorded, and was done by Frenchman Édouard-Léon Scott de Martinville using his invention the phonautograph. Soon after in 1886, there was a consensus for documenting the sounds people made. That was the birthdate of the International Phonetic Alphabet (IPA), and linguists could finally agree on a method to reliably put speech to text. The IPA encodes all the sounds of human language, and distinguishes between sounds written identically in text, such as the th in “this” (expressed as ð in the IPA) and the th sound in “thistle” (θ in the IPA). But these forms went into dictionaries and academic studies; they didn’t have much currency in the world of language translation.

Today, text to speech has undergone a transformation from quaint parlor-trick to world-class technology service. It’s available in Alexa, Siri, Cortana and countless other voice-enabled assistants. It’s able to work with automated speech recognition (ASR) to actively transcribe conversations back into text, process requests and then tell you the results. All of the assistants, except Apple, provide these technologies as services for developers.

We now have robots talking using text, and this technology has been revolutionized in the last four years. The pattern is similar to machine translation. There was the Cold War Captain America stasis where the technology didn’t advance for almost 50 years, and then suddenly it’s here and everywhere.

Text to speech was held back until machine learning and AI came around and broke it free with massive amounts of data; better, more accessible tools; and indisputable real-world results. Text to speech is on the same track as a second wave of natural language processing technologies become available. It is now commercially viable, with Google, Amazon and Microsoft advancing the technology as part of their respective cloud offerings in the new world of natural language processing using AI.

Figure 1: Bell Labs’ diagram of the Voder.

Now contrast that change with every tool in the current arsenal of computer-assisted translation software. There are no phonetic tools for processing sound in a multilingual way. What we have is an AI juggernaut versus a bilingual spell-checker on steroids. Four years is a long time in AI advancement, but barely enough time to get a localization standard adopted, let alone implemented in a toolset.

When it comes to multimedia, most localization companies today still have people reading a script into a microphone. That’s been our text to speech just like in the days of radio.

Scientists have been working at text to speech recording since the 1930s, with Bell Telephone Laboratory’s Voder, a funky pipe organ contraption (Figure 1).

57 years later Chrysler managed to get it into a talking car, the 1987 Chrysler New Yorker (video). One YouTuber described the tech as “nagging,” saying, “When it’s in gear it tells you if you don’t have your seat belt on or if a door is open. It also tells you when the washer fluid, fuel, coolant, and battery is low.”

Past attempts such as this lacked the fundamental phonetic qualities of “good speech” and they were comically easy to dismiss. Automated voice was hard to do, it was only good for short utterances, it was stilted and not worth the trouble. The same criticisms were made of machine translation at the time. People take language seriously, and these tools were missing the mark.

Machine translation is difficult enough, but speech has to address a whole different family of concerns called suprasegmentals or prosody. These are the characteristics of speech that make it uniquely difficult since it’s not just sounds. Prosody covers things like stress, pitch, melody, intonation, the pauses between sounds and dialectical variations of the same word. It is how we get seven different meanings from seven words from the simple spoken phrase in English “I never said she stole my money.” Try it; take any of those words and stress one. Depending on spoken stress, the meaning could be “someone else stole my money” or “actually, she was wrongly accused,” or “as it turns out, she stole something else.”

In a textual context you will never see these variables at play, but in speech you suddenly have an array of possibilities. For machines, this has been difficult to measure and replicate. Add dialect, add accents, add a few dozen variables and suddenly sounds are much more complicated than just words in text.

Fortunately, the aforementioned cloud vendors all support something called speech synthesis markup language (SSML). SSML was adopted in 2010 by the World Wide Web Consortium as part of their standards for the web — it’s actually very useful for labeling words as sounds, like HTML was for text. Now we can mark up a phrase, add stress or a particular pronunciation to a part of it, and then create a more realistic sound.

Using SSML and the IPA also allows us to handle the peculiar instance of when a word is not actually in that language. For example, a brand name like GAP might be called something different in English than Spanish. In Spanish, the “g” does not typically follow the rules of English pronunciation. For those familiar with terminology, GAP would be a “term” or a named entity. And like a term, the same linguistic considerations for text apply, but now also within the spoken context. This extra dimension means that text to speech practitioners need to overlap three domains:

Past attempts such as this lacked the fundamental phonetic qualities of “good speech” and they were comically easy to dismiss. Automated voice was hard to do, it was only good for short utterances, it was stilted and not worth the trouble. The same criticisms were made of machine translation at the time. People take language seriously, and these tools were missing the mark.

Machine translation is difficult enough, but speech has to address a whole different family of concerns called suprasegmentals or prosody. These are the characteristics of speech that make it uniquely difficult since it’s not just sounds. Prosody covers things like stress, pitch, melody, intonation, the pauses between sounds and dialectical variations of the same word. It is how we get seven different meanings from seven words from the simple spoken phrase in English “I never said she stole my money.” Try it; take any of those words and stress one. Depending on spoken stress, the meaning could be “someone else stole my money” or “actually, she was wrongly accused,” or “as it turns out, she stole something else.”

In a textual context you will never see these variables at play, but in speech you suddenly have an array of possibilities. For machines, this has been difficult to measure and replicate. Add dialect, add accents, add a few dozen variables and suddenly sounds are much more complicated than just words in text.

Fortunately, the aforementioned cloud vendors all support something called speech synthesis markup language (SSML). SSML was adopted in 2010 by the World Wide Web Consortium as part of their standards for the web — it’s actually very useful for labeling words as sounds, like HTML was for text. Now we can mark up a phrase, add stress or a particular pronunciation to a part of it, and then create a more realistic sound.

Using SSML and the IPA also allows us to handle the peculiar instance of when a word is not actually in that language. For example, a brand name like GAP might be called something different in English than Spanish. In Spanish, the “g” does not typically follow the rules of English pronunciation. For those familiar with terminology, GAP would be a “term” or a named entity. And like a term, the same linguistic considerations for text apply, but now also within the spoken context. This extra dimension means that text to speech practitioners need to overlap three domains:

1 Technical expertise for markup and tools
2 Linguistic ability for the written word
3 Linguistic ability for the spoken word

Text to speech practitioners need to adopt an efficient means to navigate all three demands and ask for help from post-editors. This really is like machine translation, as it’s unlikely that raw output would always be good unless it’s under controlled conditions or done by someone familiar with the idiosyncrasies of a given text to speech engine and language.

With the history and basic obstacles covered, let’s examine the three main players operating today in the cloud (Figure 2).

Figure 2: The top three automated voices on the market, as of November 2019.

Each of these three vendors also has some shortcuts to make things simpler. Amazon’s Polly, for example, has a way to support Pinyin transcription for Chinese. Microsoft’s tools can sometimes interpret numbers in Arabic/Hindi script interchangeably. By and large, Google has the most AI research. They can shorten the length of the audio or adjust pitch or rate of speech, there are many custom options available.

This AI arms race is constantly introducing new technologies with each vendor scrambling to add more languages and voices or tools. Cloud providers are far ahead of the language industry and changes are coming quickly. Cloud providers add step changes every quarter and major changes one or two times per year. The rapid pace means that the technical debt for language providers will remain as long as these technologies are evolving and is likely to remain so for years.

Despite their advances, most tools will stumble on simple things like typographic errors, odd characters or sometimes just bad data when saying a word. A skilled operator is still required. Additionally, the voiceover artist is here to stay. Someone has to do the high-quality work. Just like machine translation, the low-hanging fruit is being taken over by machines and the skill of the individual is no longer wasted saying “next” or “click here.” Post-editors are still needed, and engineers can create the scripts and automate, but someone still has to listen and validate.

The future

On the horizon is the further evolution away from text altogether. Technologies involving speech to speech, a one-shot transformation of a spoken language into another, is a likely reality in the 2020s. With those changes, we must start asking how the AI landscape will evolve to include the human-in-the-loop, which has always been the source of truth.

Another prospect is voice banking or cloning — where people’s voices can be captured by a short bit of sound and then modeled in an engine. This is an ethical dilemma where the authenticity of a person’s spoken word can now be called into question, or even used in cases of fraud. Conversely, it might provide a new monetization model for voice artists suddenly able to work on multiple projects at once. We could be back to the royalties game, but with customized individual licensing.

In short, this space is evolving rapidly and with the viability of a good enough replacement more multimedia projects and eLearning courses will have text to speech. In some cases it might actually be better to hear a diagnosis from a dispassionate robot — like maybe on the latest GDPR training.

So although speech is the oldest frontier there is in human language, there is still much to learn and innovate.