A report from the front line of NMT

Before you read another word, I already know what you’re thinking: “Not another article on neural machine translation!” Don’t worry — a lot of background information on the technology and its status in the industry was covered extensively in previous articles. Instead, this article is going to cover some different ground by deep-diving into specific cases where neural machine translation (NMT) has been implemented in practice, giving the clearest picture yet as to where and when it is proving to be most effective from a practical perspective.

Early but useable

It is worth briefly reiterating where NMT is in terms of its stage of development. Using the analogy of the ripeness of a banana, we can say that NMT is edible but still quite green. That is to say it certainly can be used right now, but it might not be to everyone’s taste, nor will it necessarily be appropriate for all possible use cases.

The biggest impact NMT has had in its relatively short existence is that it has raised the bar for the effectiveness of general purpose MT. In the universe of use cases for general purpose MT, there have been certain languages, industries and content types for which it has worked relatively well, such as Spanish news content online. On the flipside, there are certain other cases where even the most highly customized use cases for MT have struggled, never mind general purpose MT for something like Korean patent information.

What NMT has done in the short term is raised the bar for what general purpose MT is suitable for. The experience of many casual users of MT has been that it has improved to the extent that it now works in instances where it didn’t work before. This has extended to certain commercial cases too. It is quite the achievement for an advance in technology to move the bar so significantly in one fell swoop!

That being said, there are still a large number of use cases that remain above the bar. In these cases, what are the options? In the past, the answer was customized MT, which essentially meant statistical or hybrid approaches. Now, however, we have reached the stage where customized NMT can be added into the mix. This — despite being relatively uncharted territory — has become an intriguing new tool in the arsenal of MT developers.

Before diving into the detail of practical case studies involving NMT, let us ask the question, what exactly would we like to learn? We already know that machine translation is not a one-size-fits-all technology, so we would like to understand in what situations the new kid on the block — NMT — is going to be the best option. Independent of the flavor of MT, given that the bar for general purpose MT has been raised, it is also very informative to find out when customized solutions are required rather than being able to revert to more readily available general solutions.

NMT in the wild

The following case studies introduce a variety of practical deployments of NMT across three distinct projects that included: seven different translation directions, three different content types and two different use cases. The results were enlightening.

Case Study 1: Chemical patents, Asian languages

This project involved the translation of chemical patent titles and abstracts from Chinese, Japanese and Korean into English for direct publishing of the MT output for information purposes. The evaluation criteria imposed on the translations were strictly enforced as follows; all translations were ranked for “Adequacy” on a scale of 1-5 as defined here:

1. Unusable

2. Poor

3. Adequate

4. Good

5. Excellent

In terms of the acceptance criteria, a maximum of 10% of translations could be rated as poor, and there could be no unusable translations.

From a practical perspective, these languages are particularly interesting for a variety of reasons. Before NMT was an option in the project, Chinese was already in production with a relatively mature hybrid engine. Japanese was in development, although meeting the acceptance criteria was a struggle, particularly for abstracts. Korean had been tested on multiple occasions and failed to meet the acceptance criteria. Therefore, the opportunity existed to see whether NMT a) could improve on existing high-quality solutions, and b) whether it could solve a previously insurmountable problem.

Because of the nature of the content, distinct systems were used to translate the titles and the abstracts. This is because the titles are typically quite short and written in a telegraphic style, where the abstracts tend to have very long, technically complex sentences. For each language pair, NMT and non-NMT engines (referred to as hybrid from now on for simplicity) were developed for both titles and abstracts under the same conditions, using the same training data which allowed for an apples to apples comparison. Bilingual chemists were employed as linguists to determine whether the translations were fit for purpose or not, judging the titles and abstracts independently.

For Chinese (Figure 1), both technologies were ranked as acceptable for titles, but only the hybrid MT met the criteria for abstracts. On the basis of these results, the hybrid MT was chosen for production. This is still common practice as the technology is more of a known quantity and developers have greater ability to effect change where needed.

In the case of Japanese (Figure 2), the inverse was found in the evaluation results. Again, both technologies were ranked as acceptable for titles, but this time only the NMT met the criteria for abstracts. From a production perspective, it was decided to use the hybrid MT for titles for the same reason it was used for Chinese. However, for the abstract translation, the NMT engine was used, representing a significant win for NMT with a complex language and content type. We will comment on the nature of this dual approach later in the article.

Korean was a very interesting case as it was known in advance that the hybrid MT solution did not meet the acceptance criteria, so this case represented a real opportunity for NMT to have a big impact. When assessed by the linguists, it met all of the criteria except one — the presence of unusable translations (Figure 3).

Hybrid MT had only 21% of translations ranked as excellent, with 35% ranked as poor, missing the target by a wide margin. On the other hand, NMT showed significant improvement over these results with 55% of translations ranked as excellent and only 8% as poor. However, 5% of the translations were rated as unusable.

This scenario of extremes is one being encountered frequently by practitioners of NMT. When the translations are good, they are often better than before, but there is a higher frequency of peculiar translations which, in this particular case, rendered the NMT unfit for production.

This is a shortcoming of the young technology that needs to be overcome before it is universally viable. Later in this article, we will comment further on the nature of these issues and the steps that can be taken to overcome them in the short term.

Case Study 2: Post-editing Japanese patents

As opposed to Case Study 1, this project involved the translation of general patents from Japanese into English for post-editing. The goal here was to see the impact that customizing NMT could have over other approaches, and in particular, general purpose NMT.

The “customization” comes from the fact that an NMT engine, trained on general patent data, was supplemented with additional training data from the client, which represented around 5% of the total training set.

Four different engines were compared based on automated BLEU scores:

1. a customized SMT engine, built with general and client data

2. an online NMT engine (GNMT)

3. a general NMT engine (no client data)

4. a customized NMT engine, built with general and client data

The first thing we see from the results (Figure 4) is that NMT again outperforms SMT in all cases for Japanese to English, validating our findings for this language combination from Case Study 1. Next, we see an illustration of how high the bar has been raised by general purpose MT with GNMT significantly outperforming an open source general NMT engine. However, the critical finding here is that customization — even with just 5% of client-specific data — leads to a 44% increase in BLEU score over the general NMT engine (green vs. yellow) and an 18% increase over the GNMT engine (green vs. blue).

Human assessments validated these findings, but highlighted an even higher presence of “unusable” translations in the NMT output — in the region of 10-12% of the segments.

However, one of the implications of the post-editing use case is that, unlike the unusable segments for Korean in Case Study 1 that caused a failure to meet the acceptance criteria, in this case unusable translations for post-editing do not necessarily fail the output. Instead, the bad segments can be easily identified by the translator (because it is very evident they are bad) who can then simply ignore the MT and translate as normal. Another option is that the MT provider can flag them in advance or chose not to present them as options for post-editing. Of course, the medium-term goal would still be to eliminate them totally.

Case Study 3: Technical content, various languages

In this final case study, the project involved the translation of a range of IT related content from English into French, German and Hindi. The goal here was to ask the broader question of how customized NMT compares to existing approaches across a variety of quite distinct languages. We know that SMT has already set a high bar for practical performance with languages like French, but for German and Hindi there is perhaps more room for improvement.

For each language pair both SMT and NMT engines were compared, and these were built using exactly the same training data, again to make it an apples to apples comparison. GNMT was also included in the assessment as a sanity check, and as a baseline for general purpose MT. Figure 5 illustrates the automatic BLEU scores, and extensive human evaluations were also carried out.

For French, customized NMT significantly outperformed the general purpose MT, again confirming the findings from the previous case studies. However, in this case the SMT engine was the best performing engine by some margin. This was further validated by the human evaluations where the SMT engine was chosen as best in 78% of cases. French is a very well resourced language in terms of linguistic resources and technology and the bar for performance is already set quite high, therefore this was not a particularly surprising finding.

For German, the trend was similar. Again, customized NMT outperformed the general purpose MT, but overall the SMT engine was ranked best. This finding was somewhat of a surprise as German is one of the languages for which NMT has been reaching new levels of performance, at least within academic literature. The surprise was justified when the human evaluations preferred the NMT output in more than 80% of cases. This only goes to emphasize how cautious we need to be with automatic evaluation metrics — particularly with a new technology like NMT — and also how important it is to incorporate human assessments when assessing whether translations are fit for purpose.

Finally, for Hindi, the results were quite different. Customized NMT was by far the preferred approach according to both automatic and human assessments, where it was selected as the best translation in 75% of cases. This is a very welcome finding, as Hindi has traditionally been a challenging language for MT, and the advent of neural technology opens a new door for use cases involving Hindi, a language spoken by more than 180 million people globally.

The good, the bad and the ugly

Based on the outcome of these case studies, coupled with other findings in the field, a clearer picture is finally starting to emerge. Simply put, we are frequently seeing more fluent translations with (customized) NMT. Even when it might not be the best technology overall, there are certain phrases or constructs (such as number and gender agreement) that can now be translated well more often than not.

From a developer and end-user perspective, probably the most encouraging finding is that we now have solutions for certain use cases that were not often practicable in the past. We have seen repeated instances of this for Japanese and Korean, and now examples for Hindi and German also. We might hope that this is something that can be extended to longer-tail languages that do not typically get as much attention. It might soon be their time to come to the forefront.

Conversely, in an effort to manage expectations, it is also clear that we don’t yet have a silver bullet in our hands. We have seen in these case studies that existing approaches to MT were preferred for Chinese and French. Additionally, when things are close between the various technologies, there will still be a tendency to revert to the tried and trusted approaches where, from a developer perspective, there is still a lot more control over effecting change to the translation output.

There are then two other issues with NMT that hark back to the “unusable” translations from Case Studies 1 and 2. In the first instance, we have “over-generation” which is peculiar repetition of certain phrases in the MT output. In the second instance, we have “under-generation” which is essentially incomplete translations that are missing chunks of the source. It is not yet 100% apparent what the cause of these issues is, but that is the nature of bleeding-edge technology!

What can we conclude from this?

Where do these findings leave us? Another useful analogy here is to equate NMT to the electric car. No one is going to dispute the advantages of electric cars. They are very clean and eco-friendly, but they have issues, namely cost and range. Similarly, NMT is delivering extremely encouraging and useful results for a variety of use cases, new and old, but suffers from seemingly random, unpredictable issues at times.

Therefore, the key to delivering practical applications of NMT in the short term is similar to that of the electric car: hybrid solutions.

In Case Study 1 we saw an example of this where, for Japanese translation, a mix of MT technologies was used for different sections of the documents (titles and abstracts). A similar model can be deployed for engines producing unusable segments — these segments can be identified and translated using an alternative MT engine thus allowing the user to reap the benefits of NMT without the known drawbacks.

It was predicted earlier in the year that the short term for NMT — rather than the overblown promise of “human-like” quality — would bring clarity on where and when it would start to make an impact. The where is emerging but the when is certainly now.