Inter-Language Vector Space
The next frontier for AI in localization
Andrzej Zydroń
Andrzej Zydroń is the chief technical officer at XTM International and the technical architect of XTM Cloud. With over 30 years’ experience in the industry, he sits and has sat on many open standard technical committees.
Have you ever been in a situation where you overhear a foreign language conversation in a language you don’t really know, but you hear a familiar sounding word, and intuitively think that you kind of know what people are saying based on context?
Maybe you hear the word limón, and you think that sounds like lemon, and figure they are talking about food produce. Then you pick out the word lechuga and it kind of sounds like lettuce, then atún which sounds like tuna. So you smile, knowing that your initial guess about food is probably (and magically) right. Then you get curious about what they were talking about, perhaps a nice tuna salad recipe.
Your AI technology (your brain) has a dataset of words and connections it has gathered, and it can use this map to make smart assumptions even across languages it doesn’t necessarily understand.
We have developed an NLP AI technology framework called inter-language vector space (ILVS) that does something similar within and across languages and seems almost magical in its operation.
First, though, let’s take a look at the AI landscape generally and within our industry, and why AI is both getting more traction and more deployment in the language industry as a whole. Artificial intelligence (AI) has become a generic placeholder for smart technology and it is embedded in every aspect of our lives. From autonomous vehicles and ride-sharing apps, through AI-powered disease diagnosis and treatment and intelligent incident detection, to intelligent search recommendations and ad personalization, AI is a constant.
According to a recent Forbes magazine article, it is the ability to learn from and act upon data that is critical to AI’s exponential adoption; some are calling this the intelligence revolution. AI needs data, and lots of it, in order to learn and make smart decisions. When you consider the sheer volume of data that surrounds us today, this gives us a clue as to why the intelligence revolution is happening now.
Simply put, we’re in the middle of the fourth Industrial Revolution, driven by AI and big data, and the localization industry is not immune to its impact.
The evolution of AI in localization
AI has started making inroads into localization relatively recently. The biggest impact beyond a doubt has been machine translation (MT), specifically when it evolved into neural machine translation (NMT) and became standardized as the industry AI of choice. NMT has now reached a saturation point and has lost its disruptive potential. As Jean Senellart (CEO of Systran, a leading machine translation provider) recently claimed, “NMT as a tool is done.” Senellart said that in a few months, NMT totally changed the research on MT, but that its cutting-edge impact subsequently waned. “Since then there has been improvement, quite a lot the first year, and less and less over the following years.”
Simply put, traditional MT and NMT are where AI’s impact has been the most profound in the localization industry. Both technologies have evolved, made substantial progress, and reached a point where further innovation is limited.
AI beyond NMT
In the midst of the intelligence revolution, driven by data and AI, natural language processing (NLP) has made advances. Organizations have more data than ever before, and current computing power enables the storage, processing, and analysis of data to be smarter, faster, and performed in a more controlled way. The advances in AI, data processing, and computing power lead to what Markus Meisl of SAP SE recently referred to as the “intelligent enterprise” with a “new, more distributed type of intelligence across all areas of a company.” Intelligent companies develop quickly and deploy faster within an agile, integrated technology environment.
Current use of NLP in localization
It is surprising that up to this point, most localization technologies have used little or no linguistic NLP technology such as morphological reduction or speech analysis. This is all about to change. Many of the required software libraries are now available, as are the big data repositories needed for the task. However, just like many innovative technologies we have seen, the key is not using them, but knowing how to make effective use of them.
One such technological breakthrough worth looking at is ILVS, something that Google, Facebook and Babylon Health are currently involved with.
ILVS arrives to the localization scene
ILVS is a neural network-based technology framework that is able to work out relationships between words and how close their meanings are to one another. Each word is associated with a mathematical vector of 300 values which uniquely describes the word within the corpus and its relationship with other words. It’s as if every word is represented with its own unique fingerprint. The resultant word-based data structures for the corpus are known as vector spaces.
ILVS is a technology framework distinct from NMT. Both are complementary and when combined give users a very powerful toolset.
If you were to compare NMT to a map intended to get you somewhere — a top-down approach — ILVS instead allows for a bottom-up approach focusing on specific landmarks on a route. It zooms in on the exact location of mountains or lakes in the area, for example.
Adding a multilingual layer to vector space produces accurate results, aiding the human translation process. ILVS is able to analyze correspondence between words in the source and target sentence (Figure 1). This specific piece of information, not available in NMT, can be used to identify potential translation errors, transfer technical tags, and much more.
The concept of vector space in NLP terms came into being thanks to research conducted by Google and Facebook. In 2013, Google showed that by using their own news corpora, algorithms, and a vast neural network, you can predict the cur-rent word based on the context, and the surrounding words given the current word. This is to say that when the algorithm is given a sentence with a gap such as “the _ continues to bark” it is able to predict that the word to fill the gap should be “dog.”
Figure 1: Vector space probabilities across two languages.
This new framework utilizes advanced AI algorithms and well more than 200 terabytes of textual data (the size of the Bible multiplied over 40 million times) — massive bilingual dictionaries and the crawl of all of the internet across 250 languages (upwards of 31,000 language pairs) to identify connections between source and target words. Importantly, all of the data is accessible from publicly available resources, completely eliminating the risk of customer or private data breaches. In fact, the only information that can be retrieved from ILVS is the value for the probability calculation, so the actual data is not stored. Not a single sentence of private data was used to train ILVS; all its resources had been previously published on the internet.
It’s a bit like a road with no signs. NMT is the road and ILVS is the road markings and the distance markers.
Figure 2: Matching words are highlighted in a translated sentence.
Why is ILVS disruptive?
Let’s delve into what makes ILVS such an important AI technology. In contrast to current mechanisms for performing bilingual word alignments, ILVS ensures that the probability of word alignments is calculated instantly — in most cases the whole process takes well under a second. Take, for instance, the translated sentence in Figure 2. Matching words are high-lighted by black rectangles.
The results of ILVS word alignment enables opportunities, such as the identification of potential translation errors (for example, words with low matching probability to source words), translation suggestions, and many more applications not yet identified. ILVS is also able to identify potential translation candidates even if they have never appeared in the avail-able resources. This is very useful when creating multilingual glossaries — ILVS can detect even highly specialized narrow domain terms.
Use cases of ILVS
ILVS technology opens up a new frontier for research and feature development in translation management and related technology. It can assist localization stakeholders with tedious, repetitive operations such as bilingual term extraction, inline character placement and automatically aligning corpora. Let’s explain a few of these use cases.
Building terminology from translated documents or a corpus of documents is a common task for localization teams. Performed manually, it is arduous and takes time. By using ILVS, it is possible to create functionality that makes better judgements on what entries are actually terms and should be extracted. Using ILVS, you can mimic what the human brain would do to extract accurate term candidates. The net impact of doing this is that the heavy lifting part of term extraction is done, and linguists can focus on higher value activity and reviewing the final output.
We have seen quite amazing results with this type of functionality; it takes on average 85% less time to create glossaries and there is a knock-on effect to cost reduction as a more accurate and consistent terminology database is built. We see 90% accuracy levels for term candidates across 50 languages.
Linguists perform dozens of routine operations in addition to actual translation. Transferring inline elements, which entails changing fonts or inserting hyperlinks from a source to target segment, requires considerable legwork by the linguist, and this is where vector space enabled functionality can offer a solution. Whenever a linguist comes across an inline element in a source segment, algorithms automatically position it in the correct location in the target segment, like magic. There’s no need for linguists to spend time moving functional elements around segments. Auto-inline placement frees up the translator’s time so they can concentrate on more creative aspects of their work.
Depending on the subject matter, auto-inlines can provide up to 20% improvement in translator productivity. This increases up to 80% for MT post-editing. Aligning a previously translated document or corpus of documents in order to create translation memory (TM) is another cause of productivity loss for project managers and linguists. Quite often the source and target documents available are not from the same version and the segmentation of the text into segments may be different between source and target versions. The application of auto-alignment functionality results in up to 90%+ effort and cost benefit for new translation projects where no previous TM exists. ILVS technology provides a baseline for building further functionality. Here are some of the types of future functionality that ILVS makes possible:
- Post-edit verification tool allowing users to flag MT, or even human translated segments, that contain errors.
- MT evaluation feature to select the best performing MT engine for content.
- Word and phrase suggestions as linguists translate.
- Automatically learning and implementing MT post-edit corrections.
- Other features based off learning from the linguist as they translate.
Conclusion
AI in localization has gone through a first wave of maturity, moving from an emerging technol-ogy to an essential asset. In the era of the intelligence revolution, NLP is a natural successor to NMT and will continue to be a fertile ground for innovation. The positive effects of ILVS as a subset of NLP will spill over to people, resources, and processes, causing localization to be faster and more cost-effective. Future innovations in localization will be a mix of new technological breakthroughs, such as the development and deployment of new frameworks like ILVS, and innovations in the use and curation of already-existing platforms such as NMT. The next time you find yourself in a foreign destination (hopefully in the not too distant future), try your own in-built ILVS features when you order that cup of café or a glass of vino.
RELATED ARTICLES