Neural machine translation (MT) is a topic that is on everyone’s lips these days. With natural origins in academia, there has been some great in-depth coverage of how the technology actually works, ranging from broader expositions of how neural networks and deep learning can be applied to the challenge of translation to deep dives into the nuts and bolts (read: mathematics) of neural MT and its variants.
While these technical details are comprehensive and easily found (if not as easily understandable) there hasn’t been a lot of discussion about the impact of this technology on the translation and localization industry from a practical perspective, both in the short and longer term. In the absence of such discussion, there has been a lot of marketing copy and a lot of hype surrounding neural MT, much of which has been taken out of context and made more palatable for mainstream consumption. No better example of this can be found than what stemmed from the research paper published by Google in September 2016. This was a 23-page academic paper that effectively detailed their approach, experimental setup and results, with full context. There was a line in the conclusions of this paper that stated, in a fairly conservative manner, that “In some cases human and GNMT [Google neural MT] translations are nearly indistinguishable on the relatively simplistic and isolated sentences sampled from Wikipedia and news articles for this experiment.”
In the mainstream coverage, this was sensationalized in articles with headlines such as “New service translates almost as well as humans can” and “AI system is approaching human-level accuracy,” which was not the intent of the authors. There was a collective groan from MT developers around the world who have heard such grand proclamations in the past and have been working feverishly to manage expectations when it comes to the strengths and weaknesses of the technology.
That being said, while it is clear that we are at the peak of a hype cycle when it comes to neural MT, there is justifiable cause for the optimism. In order to understand what exactly the fuss is all about, we need to take a brief look back at the history of MT research and development to see where it fits in context.
A rollercoaster history ride
Things kicked off as recently as the mid-1950s when a collaboration between IBM and Georgetown University developed the first rule-based MT system to translate between Russian and English (Figure 1). The results were so positive that it was declared that the machine translation question would be solved “within five, perhaps three years.” Obviously, this turned out not to be true and it took quite a while for MT to come out of the doldrums with some well funded projects in Europe (Eurotra) and North America (METEO).
In the early 1990s we had our first paradigm shift from rule-based to statistical MT when researchers applied purely statistical, data-driven approaches to the task of translation. While initially met with skepticism, this quickly became the state of the art and has been the approach that researchers and developers have been building upon for the last 20 years — until now.
In 2014, neural MT was a fringe research topic. While neural networks are not necessarily a new concept, applying them to machine translation is. However, it requires significantly more compute power than was available with commercial computing processors. Once this changed with the advent of GPUs (graphics processing units, which can be up to 100 times more powerful than regular processers), we very quickly started to see impressive results with neural MT.
Where the excitement comes from is the fact that this is another paradigm shift in terms of how we do MT. Developers have spent the last 20+ years refining statistical MT for specific languages and use cases, incorporating linguistic information, terminology and many other approaches. This new paradigm, neural MT, as an out-of-the-box technology with limited refinement, has in a large number of cases achieved comparable, and in some cases even better results than the existing state of the art. While we are still in the very early days of this paradigm and need to realize that we don’t have a silver bullet on our hands, that in itself is where the excitement is coming from. There are so many things that we simply have not yet had the time to try from a development perspective. With statistical MT, the last number of years have involved iterative improvements, focusing very closely on specific usage scenarios and addressing detailed language-specific issues. With neural MT, we have a blank canvas and our starting point is already quite strong in terms of quality. That’s what is exciting.
Technology in its infancy
Considering that it takes a number of months from start to finish to come up with new research ideas, implement them, and design and carry out evaluations, neural MT is very much in its infancy given that people have only been working on it for a little over two years. Because of this, there are still a lot of unanswered questions and unknowns.
In academic circles, almost all research has been on the general case. In statistical MT, people have written whole dissertations addressing specific grammatical issues (such as noun phrases) and in particular languages (such as prepositions from English to Chinese). With neural MT, work has been completely language and use case independent. There’s an argument that we might not need to go so deep on a particular language because the neural network will “take care of it” but again, this has not yet been verified.
Even though there’s already interest in the industry from a research and development perspective in such a new concept, we are still in such early days that many practical considerations and issues have been addressed in existing approaches to MT, yet have not been comprehensively covered in neural MT.
This includes the ability to apply customer-specific terminology and to generally guide the translation of certain terms. Neural MT systems are notoriously poor at handling unknown words that are commonplace when translating real-world content. Related to that, research to date has been carried out using standardized data for MT system training and testing, which is generally quite clean and sanitized. As we all know only too well, we rarely have the luxury of working with content like this in practice. Neural MT has yet to be put to the test on tagged or marked-up content, strings and other “nasties” that will need to be addressed in due course.
Despite the fact that we are in the early days of neural MT, there is clearly cause for optimism based on initial performance. However, much of the evaluation that we’ve been exposed to thus far has been anecdotal: people translating chunks of text and noting, “this looks better than I’ve seen before.” Fortunately, there have been more comprehensive quality assessments carried out across a number of universities working in the field of language technology.
These assessments have broadly shown the same trend: that neural MT can be very good insofar as the results are comparable to, if not better than, existing approaches to MT in many cases. They have also shone a little more light on exactly where neural MT is doing better. For instance, with languages that have traditionally proved harder for MT — such as Japanese, Korean and Arabic — neural MT is showing very promising results. These languages have the common trait that they are grammatically complex, and highly inflected, among other things, and the neural networks are doing a good job at generalizing over these issues to generate more accurate, fluent output.
On the other hand, for languages that are “easier” for MT and where existing approaches can already perform to a very high level, the improvements seen by neural MT are much less stark, if they’re there at all. For languages where there is a lot of room for improvement, neural MT has room to improve. But the higher the initial quality bar, the less impact neural MT is having at this point in time.
These assessments have also highlighted areas where neural MT still needs work, and perhaps falls down where existing MT is strong, particularly as relates to the handling of unknown words and the application of terminology.
Like we pointed out earlier, though, these assessments have been academic in nature meaning they’ve been focused on gathering broad findings on general data. Whether these results will hold for specific industry use cases or on “real” data remains to be seen.
Testing and comparing
In order to try to better understand where neural MT might potentially have an impact in some of our core areas of business in the short term, our company partnered with the ADAPT Research Centre at Dublin City University to build neural MT engines and carry out a more practical comparative assessment.
We took one of our more mature production engines, a Chinese to English MT engine tuned for chemical patent translation, and compared its output to that of a neural MT engine built using exactly the same training data, so that we had an apples to apples comparison. The results were quite interesting.
At first glance, the neural MT engine performed quite well out of the box, producing equal or better translations than the existing production system about 40% of the time, according to human judgments. When we dug a little deeper, and carried out a qualitative evaluation of the output, we learned a lot more.
When translating the patent titles, which are on average eight words in length, the neural MT engine was better in 53% of the cases. However, as the sentences grew longer in other parts of the document, performance dropped rapidly and the existing production engine was better in almost 70% of the cases (Figure 2). Looking at the relative strengths and weaknesses of both engines in terms of output quality, we saw some strong trends emerging.
The production engine was better at producing perfect output due to its refinement over time. The accuracy of terminology was generally better, which can be attributed to the fact that we are able to directly apply glossaries when needed. However, it suffered in the same places MT always suffers, in that sometimes the output was clumsy and the sentence structure was awkward.
As for the neural MT engine, it was better in areas we’ve seen validated in the academic experiments: fluency was good and agreements between nouns and adjectives, for example, was more accurate. However, it failed in peculiar ways such as randomly leaving out large chunks of sentences in the translation (Figure 3).
What these findings serve to show is that, while neural MT is not a silver bullet today in terms of all-round performance and production, there’s a lot of promise and we have a clearer picture as to where to focus development and improvements. Part of the challenge facing us now is the why and the how. Why is it making these mistakes, and how do we fix it?
Old and new challenges
As strange as it sounds, to a large extent we don’t know exactly why neural MT is as good as it is. For the same reasons, we don’t quite know why it’s bad in some cases, or how to go about addressing those cases directly. Things will certainly become clearer over time as researchers continue to experiment, but for now we’re essentially dealing with a vaguely transparent black box.
This makes leveraging and utilizing neural MT in the industry more of a challenge. The most effective MT must be adaptable for specific business purposes, such as data and terminology, and customization for specific use cases and language will always lead to better performance. We must also be able to directly address specific issues raised with MT output. Production performance must also be addressed, given the increased time and cost needed to build neural MT engines at scale.
Many of these challenges are things that were already solved with existing approaches to MT, and their case needs to be reopened again. Generally speaking, these questions are answerable, and the problems solvable, but it is just a matter of time as to when.
While there are new challenges presented or reintroduced by neural MT, there are still a number of challenges related to machine translation in general that haven’t gone away:
1 Data. We still need data to train neural MT engines, and arguably they are more data-hungry than statistical MT. There may be approaches down the line that will be able to address this (“zero-shot” translation) but for now, lesser-resourced languages will still be left on the back burner.
2 Evaluation. We’re still dealing with MT output so we still need to evaluate the output in terms of how fit for purpose it is for various use cases.
3 Pricing. How do we charge or pay for post-editing in such a way that fairly compensates all stakeholders in the supply chain? Even with neural MT, that’s still the same question as before.
The neural frontier
What does this all mean for you — the buyer, the vendor, the consumer, the developer — today? In three years’ time?
The biggest impact that neural MT will have in the short-term — on the languages for which it has been developed and where it has shown to perform well — is that it’s going to raise the bar for the effectiveness of general purpose MT. Currently, there are certain language and use case combinations for which general purpose MT is fit for purpose. After that, end users need customized solutions in order to meet their specific needs, be it terminology, style or simply more adequate output. In the latter case, neural MT can address this issue, which means that general purpose MT will be more fit for purpose for more use cases than it has been to date.
Things will get really interesting over the next two to five years as we gain much more clarity into the how and why of neural MT. We will begin to see new types of hybrid MT that include neural approaches. Remember, when statistical MT came about, rule-based MT didn’t go away. The technologies ran in parallel, and still do in many cases. The same will happen with neural MT. We have already seen researchers working on neural post-editing of statistical MT output, and this trend of hybrid engines, and system combination will continue.
MT and the legal industry
We’re already seeing trends toward new use cases for machine translation, aside from the “traditional” post-editing workflow. These frequently include cases where we have large volumes of content that need to be turned around quickly, such as ediscovery and cross-border litigation.
In such cases, a company and/or law firm involved in a legal dispute will need to review documents retrieved from the opposition in order to find a “smoking gun” to support their legal case. The problem is that if the opposition is based in a foreign country or is a multinational operating in multiple regions, these documents are likely to be in a different language. Moreover, in ediscovery we’re typically dealing with hundreds of thousands of documents — often mixed language and in a variety of different formats and styles — which adds a further layer of complexity.
The documents can be reviewed in the native language(s), but this option will be very expensive, as it will require hiring attorneys who are fluent across multiple languages. This leaves translation as the only option to allow for native-language review, but human translation will also be very expensive, and time-consuming given the volumes in question, if even practical in the first place.
As the review typically needs to be carried out during particular windows of time based on court dates, MT quickly becomes the only sensible option. It makes even more sense for this use case because the translation doesn’t need to be perfect — it just needs to be adequate enough to allow the attorney to determine whether the document contains information relevant to the case (mentions a specific person, product or place) and whether a “full” translation is required.
Similarly, use cases that require real-time translation and translation that is fit for a particular purpose are emerging as prime candidates for MT, including multilingual customer support, ecommerce and content that is created in a continuous delivery environment. These trends will continue upward as neural MT comes on-stream in the medium term.
Longer term, it’s obviously difficult to predict the future but there is one thing we can say for certain — neural MT will not be a replacement for human translation. This is a conversation that needs to be reframed. It’s not a case of human vs. machine, or which is better. On one hand, they are complementary approaches whereby MT can be used to aid the overall translation process, be it as segments for post-editing, to provide terminology suggestions, or to help determine which specific documents from large batches require full translation.
On the other hand, there are cases where machine translation is the most appropriate solution, or the only option. As mentioned above, these include cases where instant translation is required in real-time, or massive volumes of content need to be translated in a short space of time.
Similarly, human translation is not going away. In a sense, it feels very obvious to spell it out but given some of the reporting on neural MT it also feels very necessary. When you’re dealing with mission-critical tasks, particularly challenging languages or content types, and importantly, cases where fully fluent and adequate translations need to be guaranteed, we will always take the human translation approach. Going back to the ediscovery use case, by the time the attorney is going to court in the United States to argue a case based on the German language email exchange they found via MT, you can bet your bottom dollar that they’ll be going in with a certified translation.
This first generation of neural MT solutions are only general-purpose systems, but they are clearly showing great promise and, in many cases, improvements over existing technology, for the general use case. While we need to be cautious in terms of over-blowing the potential of neural MT, and particularly the timeframe in which that potential might be realized, there is without doubt great cause for optimism.
As we continue to better understand exactly how the neural MT is working, and improve and build upon existing approaches, we can only expect further quality improvements. Just give it time.