MT application to localization of life sciences

Ireland, my country of adoption, has a large pharmaceutical industry dating from the 1960s. Originally created to produce active ingredients that would then be exported to other countries for packaging and introduction to the market, Irish pharmaceutical companies have evolved to offer expertise in areas like production of finished products, research and technical support.

The fact that Ireland offers tax incentives and a highly-educated workforce has attracted a large number of international players. According to the IPHA (Irish Pharmaceutical Healthcare Association), “approximately 120 overseas companies have plants in Ireland including 9 of the 10 largest pharmaceutical companies in the world.”

To me this means endless opportunities for localization and, dare I say, machine translation (MT).

Getting close to convergence

I have been working in the MT industry for over 15 years and I have seen reluctant reticence become mild curiosity, and such curiosity changed to enthusiastic embrace or mild acceptance, almost in equal measures. One way or another it is quite clear to me (from where I sit at this very moment) that this technology is here to stay and can be used in many different ways in the localization industry. Paraphrasing TAUS, we are approaching the age of convergence.

One of the main pushes of MT adoption is the amount of data shared across the globe and available in digital format; as digital usually means easily accessible and localizable.

Peers and buyers in the localization industry have begun to realize that MT is good to have; sometimes the reason might be that their clients require it, or that their competitors offer it and are able to lower prices because of it. Often, the deadlines are too tight. There are literally not enough well-prepared translators to deal with the volume of words that are being produced and published today.

Specialized publications are issuing abundant articles about the technology (taking this very article as an example). Still, how easy is it to discern what type of content is adequate for an MT program? What about language pairs, do they all produce the same results? What are the real benefits of this technology? Is this for me?  How will I make sure that I improve my return on investment while paying for a technological investment? And the latest dilemma, what is neural MT (NMT)? Is it more expensive? Do I really need it?

The road to MT

These are some of the questions that I encounter daily from localization clients. MT is an investment, not only in the monetary terms, but also from the point of view of logistics (let’s think of connectors to the various computer-aided translation tools, translation memory systems and content management tools). And above all, it is an emotional investment.

When a localization company embarks on the adoption of MT technology, buy-in from all departments is required if deployment is to be smooth and successful. Production teams need to understand at which step in the workflow MT is incorporated, vendor managers need to know how to pay for the post-edited word, quality managers want reassurance that quality levels are not going to drop, and finally, translators (future post-editors) need to be shown that MT can be a productive tool, just like translation memory (TM) is.

For experienced translators, the change to post-editing can be emotional and sometimes disruptive; cognitive requirements are different. Upsetting the language force is one of the things that MT buyers are most concerned about, as specialized linguists with good subject matter knowledge are difficult to come by.

When success occurs

Success occurs when the mental paradigm shifts. After a learning curve that can take a few weeks, translators begin to learn how to deal with MT output, and to be able to assess at a glance what needs to be edited in the segment. Information about how the MT system works, and what type of issues are likely to be present in the MT output, is usually welcomed by post-editors. In my opinion this is the most difficult part of MT deployment. Once acceptance from all involved parties exists, the next steps would be to analyze content types, assess pros and cons of different technology providers, readjust rates and set up connections with required workflows.

Going back to life sciences, and taking the example of the pharmaceutical industry, I would like to go through some cases where MT can be beneficial. To that end, we can associate some content types with localization for the pharma industry: documentation related to clinical trials, research papers, patient’s brochures, packaging and labeling of medicines or medical devices, marketing materials, online help and software.

There is already reuse of this kind of translated content in the form of TMs and glossaries, which means that the in-domain bilingual data required for the creation of MT systems is already available.

Given the fact that some of the pharmaceutical processes from beginning to end are lengthy — for example, testing and releasing a new drug from clinical trials to labeling, packaging and marketing — agile workflows are a must. Speed is a requirement that can be addressed by deploying an MT workflow.

There are two other important requirements. Readability is one of them and it is very closely associated with patient information. Two examples of this content that requires good readability would be pharmaceutical documentation and consent forms. It is essential that this information is understood by most patients, and for that reason, it needs to be easy to read.

Finally, quality is the most important requirement in this industry; understandably so, considering that a patient’s wellbeing is at stake. Quality in this context refers to the level of accuracy with which the source is reflected in the target. Related to this point, translated content originated in machine translation is often viewed as lacking in quality. I have had this question asked by clients on numerous occasions. Originally there was a latent mistrust on the reliability of MT output that seems to be slowly easing off.

The reality is that post-edited content does not necessarily contain any more errors than content that has been translated from scratch or translated with the aid of a TM. Studies put forth by Ana Guerberof Arenas and Elaine O’Curran show evidence of this.

The fact is that machine translated content will go through the same stringent quality processes as any other translated content for this industry. In fact, the reviewers who assess the final quality of the translated content should not know whether this originates in an MT workflow or not.

Highly customized statistical machine translation (SMT) will deal with terminology requirements in an efficient way; it will also deal with figures and measurements with accuracy, if it has been trained properly and is supported with the right automatic rules. For that reason, all content related to documentation, stock keeping unit codes, user manuals of technical devices and so on can be successfully translated using SMT systems, provided that those have been trained with the relevant data and that there is a process of post-editing to repair the MT output afterward.

The wonders of neural

When dealing with critical content that needs to be fully understood by most people who read it, being able to grasp the meaning without difficulty becomes a priority. In this case, translation must be accurate and cannot contain any factual errors or misinformation. For this type of content, translation needs to be easily comprehensible as well, providing fluent language and avoiding unidiomatic structures that might make it difficult to be understood. And here is where NMT can make a difference.

I have been lucky enough to coordinate an evaluation study that compared NMT with SMT, as part of a research initiative led by my colleague Dimitar Shterionov, head of KantanLabs. We evaluated content translated by SMT and NMT engines trained on the same data. This was done for five different language pairs and performed by three different testers per language. The results, which were presented at the European Association for Machine Translation Conference (EAMT 2017) last May in Prague, were conclusive: all 15 testers, independently of the target language they evaluated, scored the NMT output as the best of the two.

I was one of those testers. It was a great experience for me to see firsthand the differences between the two outputs, and even though this was a blind test and the testers could not see the origin of the segments at any given time, sometimes it was clear what translation came from SMT and what came from neural. In my case (the Spanish test), the quality of the SMT engine was very good, if I had been doing a straight-forward human evaluation I would have possibly given it a 3.5 or 4 out of 5, but the quality of the NMT output was better in a subtle way. As well as being accurate, it conveyed the nuances of the language. Sometimes it was only a matter of choosing an indefinite article instead of a definite one, or a change in the structure of the sentence. In my experience, fluency greatly increased when compared with the output originating from the SMT engine.

Of course, more evaluations need to be carried out. We are already researching and looking to see more specifically where the strengths and weaknesses of each system lie.

The joy of gisting

There are other uses of MT, and one that is taking prevalence is gisting. The act of gisting, in MT terms, is the act of reading raw MT output, or output that has not been post-edited or altered by a person, with the purpose of getting the central idea of a piece of content.

Raw MT is already being used by large ecommerce platforms to translate product descriptions, and by travel sites to translate user reviews. It is a good solution for making user generated content available in other languages as, due to the large volume and highly perishable nature of the content, it would not have been translated otherwise.

If we go back to our subject matter, life sciences, MT becomes highly useful for information and dissemination of research papers. English is the lingua franca when it comes to writing and publishing research papers. Such papers are stored in large medical related searchable databases like PubMed. The abstracts of those papers are usually written in English, but sometimes the main text is not available in that language. The use of MT for gisting by medical professionals can be of great help when searching for appropriate research in a subject of interest.

At the end of the day, MT is a productivity tool used to make content.