The automated interpreter

By Hassan Sawaf & Jonathan Litchman December 20, 2011

Earlier this year on Jeopardy, IBM’s Watson won over $77,147 and showcased the latest advances in speech recognition and language technology with its answer to the Final Jeopardy question, “William Wilkinson’s ‘An account of the principalities of Wallachia and Moldavia’ inspired this author’s most famous novel” with “Who is Bram Stoker?”

iPhone’s much-discussed Siri is not only the virtual assistant to millions, but she is fending off marriage proposals with a simplicity and grace that belie the technology’s sophistication: “My End User Licensing Agreement does not cover marriage. My apologies.” In addition to processing long voice commands and synthesizing a response, Siri is interpreting the meaning of language.

Automatic speech recognition (ASR) has been advancing rapidly, hitting the consumer market at full force and making a case for being the hottest technology of 2011 and possibly 2012. Today, the most sophisticated consumer and enterprise machine translation (MT) engines have fully integrated ASR technology for speech-to-speech and speech-to-text translation capabilities. Simply put, ASR is a game changer with far reaching implications for businesses, consumers, translators and the language service industry. Imagine traveling abroad and holding conversations without having to learn the language. Machines are breaking the language barrier and enabling cross-lingual conversations. The technology is here, after more than 40 years of development and research.

Many people are surprised to hear that at its core, ASR technology is actually fairly old — ancient in technology years. In fact, in the 1970s ASR was first applied by Faceplate manufacturers and FedEx in parallel with the Defense Advanced Research Project Agency’s project SUR that ultimately resulted in a system that could recognize about 1,000 words. The earliest versions of ASR were able to understand digits and simple words and were precursors to the automated systems employed by airlines and banks today, with which many of us have become all too familiar: “If you would like to speak to a customer representative, say representative.”

ASR took another big step during the IT revolution of the 1990s. In 1991, Philips launched the VOICE System 4000, which was capable of longer, more complex dictation and quickly became standard in hospitals. Dictation systems were also popular in law offices, hospitals and other industries where paperwork and forms dominated. The government was also developing and utilizing ASR systems for security and monitoring applications.

In combination with the vast increase in computational power, the internet played a large part in fueling the latest breakthrough in today’s ASR technology. ASR technology is now able to gather more context for the input it receives and is able to use statistical probability to determine the most likely output. In other words, the computer is now thinking and interpreting the user’s meaning. However, challenges remain.

Context challenges

While ASR and MT both use computers to interpret meaning, ASR is much more sensitive to incorrect use. The main reason for this is because there are many additional variables that influence the quality of ASR. These variables can include the positioning of a microphone, background noise, accents and speech patterns.

When translation enters the picture, the obstacles facing ASR multiply exponentially. The largest challenge by far is a lack of context. For example, if the user would like to translate an English sentence into Arabic, the automated interpreter does not know if it is addressing a single male or a group of females, for example. This knowledge is critical for the machine to translate the sentence correctly, as some languages use gender and number to generate the correct form of a term. An unwitting mistake can offend the addressee, as the linguistic differentiators have significant cultural meaning.

A lack of context is also to blame for the classic MT mistake when a word has multiple meanings and the machine does not know which to select. For example, in translating “How do you get to the bank?” the user could be referring to the financial institution or the river, depending on the location and whether he or she has a checkbook or a fishing pole in hand.

This is why, even with the multiple approaches and variables that impact ASR and translation quality, the accuracy of a system correlates highest with the amount of effort and time spent tailoring the machine for a specific context. In the advertising world, copy will refer to the text in an advertisement, whereas in most industries the intended meaning is to duplicate. This leads to misconceptions regarding the quality of MT in the translation community, as many are unfamiliar with tailored, more accurate systems.

Approaches to MT

ASR and MT were two very divergent fields until the 1990s, when there was a move toward integrating the two technologies. ASR was housed in the engineering and signal processing disciplines, while MT was a focus of the linguistics, literature and art fields. With speech being central to language and human communication, this integration was only a matter of time, but the internet, with its large amount of video and audio data, drove the integration of ASR and MT.

Integrating ASR and MT into a single platform is important to ensure there are no errors in translation that occur when one engine’s corpus has a word the other does not. This is cutting-edge technology, and there are few fully integrated products available.

The type or approach of the MT engine is a significant factor in the successful integration of ASR. For those unfamiliar with MT technology, there are three main approaches: rule-based (RBMT), statistical (SMT), and hybrid (combining rule-based and statistical).

RBMT applies hand-crafted rules used to analyze and translate one language to another. Because languages are filled with irregularities, words with multiple meanings and phrases with meanings beyond their literal translation, RBMT on its own might miss some semantic meaning and can be less accurate — or just more difficult to read — than other approaches. Siri is an example of a rule-based language-comprehension system behind statistical ASR.

SMT involves the use of previously translated content to determine which words and phrases have the highest probability of conveying the correct meaning. However, this requires a significant amount of electronic translated data for the complex algorithms to be effective. Watson is an example of a statistical-based language-comprehension system behind statistical ASR.

A true hybrid approach integrates rule-based and statistical methods into a single engine. This has several advantages for MT, ASR and their integration. In 2006, the National Institute of Standards and Technology’s Open Machine Translation Evaluation revealed hybrid approaches to have the highest accuracies, especially on noisy data (speech and inaccurate text input). Also, combining the two systems allows the system to do more with less. Hybrid systems can translate phrases or sentence fragments, as well as develop new languages faster and with less training data. In other words, hybrid systems use the best of both worlds to overcome the context challenges human language technology faces.

With ASR technology advancing rapidly and becoming fully integrated into MT, its impact on businesses competing in an increasingly global marketplace promises to be immense. ASR is not likely to be used for formal or important cross-lingual business matters such as speeches or board meetings. However, it will enable informal cross-lingual communication on a scale where human interpreter involvement is impractical. This is not an insignificant achievement. Human interaction and communication form the core of business and real-time meetings, which are often where real work is accomplished.

Over the years, technology has become easier to use, less cumbersome and more personalized, which will further extend ASR’s impact on the business world. For example, the early days of video conferencing featured a plethora of bulky equipment including heavy monitors and large cameras. It was thought to be unlikely that this equipment was going to be mobilized for anything but important, large meetings.

Today, most desktop computers, laptops and even phones have video conferencing capabilities enabling small, informal and impromptu meetings to be held on an unprecedented scale. ASR and MT will allow these meetings to be held across borders and languages so that businesses can be better coordinated and more efficient. Just as video conferencing became the industry standard the more user friendly it became, ASR and MT technology are on the same trajectory.

In addition to the vital role ASR and MT will likely play in internal communications within the enterprise, they will have an external role as well. ASR and MT will be able to help businesses cross the language divide to communicate with their customers.

For example, the New Jersey Department of Health is currently issuing tablet devices with speech translation programs to health care workers in clinics across the state so they can better communicate with patients who do not speak English. While the program is in its early stages, so far it has been deemed a success and illustrates that the regular use of ASR for translation services in the workplace is not in the distant future, either; it is here today.

Future consumerization

ASR technology has made and will continue to make waves in the consumer marketplace. Smart phones have utilized ASR for years, with Siri being the latest, most advanced example. While it remains to be seen if ASR will be used for lengthy tasks or processing documents, ASR holds tremendous promise to add productivity to people’s everyday lives by more quickly accomplishing small tasks throughout the day.

Integrated ASR and MT consumer applications are another story. The technology is still very advanced for the consumer market. While there are multiple consumer applications for speech translation available for smart phones, in reality they are just different user interfaces utilizing Google Translate as the main speech translation technology. There are even phone commercials urging people to rethink the possible by showing an English speaker using his phone to translate his sentence for an elderly Italian man, despite the company not having the technology in-house. In other words, the consumer technology is not as widespread as commercials or the smart phone app market would lead you to believe.

However, other technology companies are developing competing offerings for consumers. The next generation of consumer speech translation applications will be tailored for a specific purpose or industry, such as travel. This will provide greater accuracy by contextualizing the environment or purpose for which the technology will be deployed. Also, consumers will be able to utilize these applications when they do not have an internet connection, which is not always available when traveling.

ASR and MT will not likely be used in formal situations or for long communications such as lectures, but rather it will enable informal conversations at a level where the expense of a human translator doesn’t make sense. No longer will travelers have to flip through an English-to-Spanish dictionary to ask a Madrileño where they can find the nearest bathroom. The automated interpreter will enable natural conversation in real time, but human translators and interpreters need not worry. MT has quickly developed from a novelty to a technology that has redefined the language service industry. Today, most language service providers (LSPs) use some form of MT to increase the efficiency of their human translators. Even freelance translators have adopted MT to a stunning degree. A survey presented at ProZ.com’s Great Translation Debate revealed that 25% of freelancers used MT three years ago compared to the 60% who do so now.

ASR does not increase efficiency of translation like MT does, and so it is more likely to enable translation on a more informal level and meet unmet demands rather than redirect business or modify translator work processes. ASR combined with MT will potentially increase business opportunities for translators and LSPs, even though it is unlikely to affect the interpretation business. The increase of multimedia content on the web increases the opportunity for translation if the speech is converted into transcripts. Many of these transcripts will need to be post-edited and will increase the demand for human post-editors. Sheer volume makes it impractical for the vast amounts of multimedia online to be translated by humans alone.

ASR may also reinforce a concept for translation services that MT has already introduced — the idea of “acceptable inaccuracy.” Businesses will continue to determine the level of accuracy needed for specific purposes and will seek the most cost-effective offering at that level of translation. Automated translation and inaccuracies may be acceptable in an internal chat, whereas human translation and full localization will be required for an ad campaign. In other words, businesses will want to control the level of translation they need and pay accordingly.

Both ASR and MT technology advances will move from word-to-word translation toward advancing the art and meaning of communication for businesses and individuals. More languages will be able to be translated. It’s entirely possible and even plausible that the language barrier will disappear within the next decade. The translations will become more accurate as the technologies are able to gather additional context from inputs outside of speech, such as GPS or optic technologies.

The applications for the technology in the immediate future are numerous as well. The technology could be used to automate closed captioning across languages and the implications for the traditional business supply chain are tremendous. It’s promising technology, even if it’s not quite Star Trek’s Universal Translator yet.