The Unending Quest 
for Human Parity MT

by Kirti Vashee

The subject of “translation quality” has always been a challenging communication issue for the translation industry. It is particularly difficult to explain this in a straightforward way to an industry outsider or a customer whose primary focus is building business momentum in international markets, and who is not familiar with localization industry translation-quality-speak. 

Since every LSP claims to deliver the “best quality” or “high quality” translations, it is difficult for customers to tell the difference in this aspect from one service provider to another. The quality claim between vendors, thus, essentially cancels out. 

Comparing different human translation works of the same source material is often an exercise in frustration or subjective preference at best. Every sentence can have multiple correct, accurate translations, so how do we determine what is the best translation?

The industry response to this problem is colored by a localization mindset we see in approaches like the Dynamic Quality Framework (DQF). Many consider this too cumbersome and overly detailed to implement in translating modern fast-flowing content streams. Cumbersome also means the cost of measuring is greater than the perceived benefit of the accuracy. While DQF can be useful in some limited localization use-case scenarios, the ability to rapidly handle and translate large volumes
of DX-relevant content cost-effectively is increasingly a higher priority and needs a new and different view on monitoring quality. The linguistic quality of the translation does matter but has a lower priority than speed, cost, and digital agility.

Today, MT solutions are essential to the global enterprise mission. Increasingly, more and more dynamic content is translated for target customers without EVER going through any post-editing modification. The business value of a translation is often defined by its utility to the consumer in a digital journey, basic understandability, availability-on-demand, and the overall CX impact, rather than linguistic perfection. Generally, useable accuracy delivered in time matters more than perfect grammar and fluency. The phrase “good enough” is used both disparagingly, and as a positive attribute, for the translation output that is useful to a customer even in a less than “perfect” state. 

While machines do most of the translation today, this does not mean that there is not a role for higher value-added human translation (increasingly supported by CAT tools). If the content is a critical and high-impact communication, most of us understand that human oversight is critical for success in the business mission. And if translation involves finesse, nuance, and high art, it is probably best to leave the “translating” computers completely out of the picture.

Defining Machine Translation Output Quality

The MT development community has had less difficulty establishing a framework for useful comparative measurement for translation quality. Fortunately, they had assistance from NIST who developed a methodology to compare the translation quality of multiple competing MT systems under carefully controlled evaluation protocols. The NIST used a variant of BLEU scores and other measures of precision, recall, adequacy, and fluency to compare different MT systems rapidly in a standardized and transparent manner. 

The competitive evaluation approach works when multiple systems are compared under carefully monitored test protocols, but becomes less useful when an individual developer announces “huge improvements” in BLEU scores, as it is easy to make extravagant improvement claims that are not easily validated. Independent evaluations used by many today provide comparisons where several systems may have actually trained on the test sets — this is the equivalent of giving a student the exam with the answers before a formal test. Other reference test set-based measurements like hLepor, Meteor, chrF, etc. are also plagued by similar problems. These automated measurements are all useful, but unreliable indicators of absolute quality.

Best practices today suggest that a combination of multiple automated measures needs to be used together with human assessments of MT output to get a handle on the relative quality of different MT systems. This becomes difficult as soon as we start asking basic questions like:

  • What are we testing on?
  • Are we sure that these MT systems have not trained on the test data? 
  • What kinds of translators are evaluating the output?  
  • How do these evaluators determine what is better and worse when comparing multiple correct translations?

Conducting an accurate evaluation is difficult, and it is easy to draw wrong conclusions stemming from easy-to-make errors in the evaluation process. However, in the last few years, several MT developers claimed to produce human-parity MT systems. This is especially true with the advent of neural MT. These claims are useful for creating a publicity buzz among ignorant journalists and fear amongst some translators, but usually disappoint anybody who looks more closely.

This is the unfortunate history of MT: over-promising and underdelivering. MT promises are so often empty promises.

I challenged the first of these broad human parity claims in this blog post: The Google Neural Machine Translation Marketing Deception. A few years later Microsoft claimed they reached human parity on a much narrower focus with their Chinese-to-English news system but were more restrained in their claim.

Many, who are less skeptical than I am, will interpret that an MT engine that claims to have achieved human parity can ostensibly produce translations of equal quality to those produced by a human translator. This can indeed be true on a small subset of carefully selected test material, but alas we find that it is not usually true for a broader test.

We should understand that among some MT experts, there is no deliberate intent to deceive, and it is possible to do these evaluations with enough rigor and competence to make a reasonable claim of breakthrough progress, even if it falls short of the blessed state of human parity. 

There are basically two definitions of human parity generally used to make a claim.

  • Definition 1: If a bilingual human judges the quality of a candidate translation produced by a human to be equivalent to one produced by a machine, then the machine has achieved human parity.
  • Definition 2: If there is no statistically significant difference between human quality scores for a test set of candidate translations from a machine translation system and the scores for the corresponding human translations, then the machine has achieved human parity.

Again, the devil is in the details, as the data and the people used in making the determination can vary quite dramatically. The most challenging issue is that human judges and evaluators are at the heart of the assessment process. These evaluators can vary in competence and expertise and can range from bilingual subject matter experts and professionals to low-cost crowdsourced workers who earn pennies per evaluation. The other problem is the messy, inconsistent, irrelevant, biased data underlying the assessments.

Objective, consistent human evaluation is necessary but difficult to do on a required continuous and ongoing basis. Additionally, if the underlying data used in an evaluation are fuzzy and unclear, we actually move to obfuscation and confusion rather than clarity. 

Data from the Hustle shows only 4% of MTurk worlers earn the US minimum wage.

Useful Issues to Understand 

While the parity claims can be true for a small carefully selected sample of evaluated sentences, it is difficult to extrapolate parity to a broader range of content because it is simply not possible to do machine translation output evaluation on an MT scale (millions of sentences). If we cannot define what a “good translation” is for a human, how is it possible to do this for a mindless, common-sense-free machine, where instruction and direction need to be explicit and clear?

Some questions to help an observer to understand the extent to which parity has been reached, or expose deceptive marketing spin that may motivate the claims:

What was the test data used in the assessments? 

MT systems are often tested and scored on news domain data which is most plentiful. A broad range of different types of content should be included to make claims as extravagant as having reached human parity. 

What is the quality of the reference test set?

Ideally, only expert human-created test sets should be used.

Who produced the human translations being used and compared?

The reference translations against which all judgments will be made should be “good” translations. Easily said but not so
easily done. 

How much data was used in the test to make the claim? 

Human assessments are often done with as few as 50 sentences, and automated scoring is rarely done with more than 2,000 sentences. Thus, drawing conclusions on how any MT system will handle the next million sentences it will process is risky, and likely to be overly optimistic.

Who is making the judgments and what are their credentials?

It is usually cost-prohibitive to use expert professional translators to make the judgments, and thus evaluators are often acquired on crowdsourcing platforms where evaluator and translator competence is not easily ascertained. 

Doing an evaluation properly is a significant and expensive task, but MT developers have to do this continuously while building the system. This is why BLEU and other “imperfect” automated quality scores are so widely used. These scores provide the developers with “roughly accurate” continuous feedback in a fast and cost-efficient manner, especially if they are done with care and rigor.

The Academic Response

Recently, several academic researchers provided feedback on their examination of these MT-at-human-parity claims. The study is called “A Set of Recommendations for Assessing Human–Machine Parity in Language Translation” shows the many ways in which evaluations can go wrong and dispute the claims of Google, Microsoft and others.

“Perhaps we need to finally admit that human parity MT at scale is not a meaningful or achievable goal. If it is not possible to have a super-competent human translator capable of translating anything and everything with equal ease, why do we presume a machine could?”

What Would Human Parity MT Look Like?

MT developers should refrain from making claims of achieving human parity until there is clear evidence that this is happening at scale. Most current claims on achieving parity are based on laughably small samples of 100 or 200 sentences. It would be useful to us that MT developers refrain from making claims until they can show all of the following:

90% or more of a large sample (>100,000 or even 1M sentences) that are accurate and fluent and truly look like they were translated by a competent human.

Catch obvious errors in the source and possibly even correct these before attempting to translate. 

Handle source variations with consistency and dexterity.

Have at least some nominal contextual referential capability.

Note that these are things we would expect from an average translator. So why not from the super-duper AI machine?

Until we reach the point where all of the above is true, a claim that more explicitly stated key parameters below would certainly create less drama and more clarity on the true extent of the accomplishment:

  • Test set size.
  • Descriptions of source material.  
  • Who judged, scored, and compared the translations.

I am skeptical that we will achieve human parity by 2029 as some “singularity” enthusiasts have been saying for over a decade. The issues to be resolved are many and will likely take longer than most best guesses made today. Machines are unlikely to render humans obsolete any time soon. 

Recently, some in the singularity community have admitted that “language is hard” as you can see in attempted  explanations of why AI has not mastered translation yet. Language does not have binary outcomes, and there are few clear-cut and defined rules.

Michael Housman, a faculty member of Singularity University, states the need for more labeled data, but notes that it’s inherently difficult to assign these informative labels. “Two translators won’t even agree on whether it was translated properly or not,” he said. “Language is kind of the wild west, in terms of data.” And as we know data is what AI is built on.

Perhaps, we need to finally admit that human-parity MT at scale is not a meaningful or achievable goal. If it is not possible to have a super-competent human translator capable of translating anything and everything with equal ease, why do we presume a machine could?   

Perhaps what we really need is an MT platform that can rapidly evolve in quality with specialized human feedback. Post-editing (MTPE) today is generally NOT a positive experience for most translators. But human interaction with the machine can be a significantly better and positive experience. Developing interactive and highly responsive MT systems that can assist, learn, and improve the humdrum elements of the translation task instantaneously might be a better research focus. This may be a more worthwhile goal than aiming for a God-like machine that can translate anything and everything at human parity. 

Maybe we need more focus on improving the man-machine interaction and find more elegant and natural collaborative models. Getting to a point where the large majority of translators always want to use MT because it simply makes the work easier and more efficient is perhaps a better goal for the future of MT. 

AI has had a huge impact on our lives, and it’s likely we are only at the beginning of the future potential impact. However, as Rodney Brooks, the co-founder of iRobot said in a post entitled – An Inconvenient Truth About AI: “Just about every successful deployment of AI has either one of two expedients: It has a person somewhere in the loop, or the cost of failure, should the system blunder, is very low.” 

There are many reasons to question whether bigger models, more data, and more computing will solve these challenges. Machines can only encode and decode data that is explicit and available. The unspoken, unwritten but understood context that surrounds any communication are unlikely to be discernible even by crazy huge models like GPT-4. This also suggests that humans will remain at the center of complex, knowledge-based AI applications even though the way humans work will continue to change. The future is more likely to be about how to make AI be a useful assistant than it is about replacing humans. 

Kirti Vashee is the Language Technology Evangelist at Translated Srl, where he works in a  MT marketing strategy and sales support role.

RELATED ARTICLES

WEEKLY DIGEST

Subscribe to stay updated between magazine issues.

MultiLingual Media LLC