Where machine learning falls short in localization

The first conversations around machine learning date back as far as 1949, when the first books on the topic were published. However, it wasn’t until ten years ago that machine learning really took off, as organizations clamored to understand and integrate the latest technology into their software offerings. The general population started to take note after the fateful 2011 episode of Jeopardy in which IBM Watson’s machine technology beat long-reigning champions Ken Jennings and Brad Rutter.

But what exactly is machine learning, and what are its capabilities today? In short, machine learning algorithms can figure out how to accomplish important tasks by generalizing how to do them based on previous examples. In the translation space, machine learning is proving extremely valuable when it comes to natural language processing (NLP) by helping to identify regularly used words and phrases and automatically translating them into other languages. An example of NLP already working to ease routine work would be the AI features from Gmail like Smart Reply (suggested responses to emails based on previous messages), and Smart Compose (predictive writing suggestions as you draft an email). While NLP capabilities in translation sound great, issues arise due to the complexity of natural languages. These are highly irregular and frequently evolving, consisting of local dialects. Even adding a single word can drastically change the meaning of a phrase and force changes on all surrounding words. These complexities make working with machine learning and languages in general extremely expensive, from a time and computation perspective.

One problem machine learning has already solved for the translation industry is poor source segmentation of content. Prior to machine learning, source content would be translated one chunk at a time, often resulting in unusable material since each system views content differently. Now, machines build language and word alignment models between languages using statistical machine translation methods, which allow you to go back and make necessary edits to content after it has been translated. This sort of machine translation tool enables users to input source-language text, and the engine produces a complete and mostly-accurate (albeit perhaps too literal) translation in the desired language, leveraging previously translated content. While this solution saves time in translating commonly used content, it is only the first of many solutions machine learning could offer the translation process.

Despite being introduced 70 years ago, machine learning is still considered a new initiative and has not yet been fully optimized for the translation industry. This is especially true when it comes to localizing content, not just translating, and as a result the various methods still fall short in several ways.

Converted content

A significant amount of translated content contains information that needs to be converted, running the risk of throwing off the system. When it comes to recognizing different date formats — a relatively straightforward task — machine learning can typically identify such conversions. However, when it comes to recognizing units of measurement, converting them properly becomes a bit more difficult. Identifying the intention of a strong tag over a few words and then applying it properly when the words are no longer in order makes translation more difficult. Adding numbers into the mix, sometimes without indication that they are numeric (think of the word one vs. the numeral 1), makes things even more challenging. Even humans themselves face these problems when dealing with very complicated structures within such converted content.

Subtle word alignment

While machine learning helps with improved word alignment, there is still room to master the art of subtle word alignment. Word meaning varies heavily according to the context of the surrounding sentence, paragraph and entire text in incredibly nuanced ways. This makes basic alignment extremely difficult. Such alignment requires thoroughly built-out models that understand context across different linguistic and cultural lines.

This is worse when it comes to things like “romance copy,” where the meanings of words are stretched to their limits and have to be completely rewritten when crossing into a new language or locale. An example of the slightest change making a world of difference across cultural lines comes into play with languages such as Pormpuraaw, spoken by a small Aboriginal community in northern Australia where they address space using cardinal direction terms. They don’t say “There is an ant on your left leg,” they say “There is an ant on your southeast leg.” A machine would never be able to know where the speaker was facing when they made that statement. Such languages may be incapable of direct literal translation.

Content categories

Each content category brings a whole new set of content needing to be translated, each distinct enough to be almost considered its own dialect. Whether they are verticalized categories such as legal, healthcare or academic content, each category comes with specific jargon that does not exist the same way anywhere else in the greater language. Some companies’ offerings, such as IBM Watson Language Translator, include neural machine translation models which have been optimized so specific verticals can begin to meet content halfway with machine learning.

Another category to consider is content from popular culture, which evolves much faster than established industry content. Think of slang or terms coined in movies, television or on social media — these are constantly changing in what they mean, making the use of machine learning nearly impossible. Even the way we speak to each other at breakfast, text casually and write papers for school can be very different. Each content category that comes into play increases the size and complexity of the model needed to do the work in order to properly understand the context of where the language is being used.

The numerous ways in which machine learning still falls short with localization demonstrates just how many uses for machine learning exist beyond word-for-word or phrase-for-phrase machine translation. Many of these challenges will be solved over time, and perhaps machines will learn to better convert numeric related content. Some translations, however, are much more difficult to teach — like constantly changing pop culture references. The solution here is a combination of machines and humans. Human translators have the personal knowledge and cultural insights needed to truly localize content in ways a machine currently cannot, and perhaps may never be able to achieve.