Unpacking the Black Box

It’s my Parity and I’ll Cry if I Want To

A tour of the human parity party in machine translation


John Tinsley

John Tinsley is co-managing director of Iconic Translation Machines, which joined the RWS Group in 2020. He has worked with machine translation over the last 16 years in both academia and industry. In 2009, he received a PhD in Computing from Dublin City University,
after which he co-founded Iconic. He now leads Iconic as a dedicated technology division of RWS Group.

When it comes to machine translation (MT), the question of quality and how to effectively carry out evaluations has always been near the top of the agenda. The topic came even further into focus with the advent of neural machine translation, and that’s when we were introduced to a new phrase: human parity.

An MT engine that has achieved human parity means, ostensibly, that it can produce translations of equal quality to those produced by a human translator.

These two simple words first raised eyebrows, and then raised the ire of (most) MT practitioners and translators. Those who claimed to have developed MT engines that achieved human parity gained headlines in the media, but were widely pilloried by parity poopers within the research and development community.

This fallout did not make the human parity claims go away, but it did inspire a lot of debate and research into what exactly it means to achieve human parity, and to how best to conduct these assessments to add more robustness, diligence, and consistency to the process.

What is human parity?

At this stage, you might rightly be asking, what does human parity actually mean? A few definitions have been proffered: a human judging the output from an MT engine to be equivalent to that produced by another human. Or perhaps a set of machine translations achieving the same score as a set of human translations using some scoring mechanism. More loosely, it means that MT output is statistically indistinguishable from human output. To summarize in simpler terms, it refers to machine translations that are as good as human translations.

We’re already on very shaky ground when trying to discuss what constitutes a “good” translation, so let’s park this topic for the time being and take a look at the evaluation process.

Raising issues with the claims

Many questions were asked when the first human parity claims were made: what was the test data used in the assessments? Who produced the human translations being compared? Who is making the judgments and what are their credentials?

Researchers found that the test sets had already been translated, and then back-translated into the original language. This means the content of the test sets would be simplified from a linguistic perspective, and thus easier to machine translate.

“Let’s step away from the idea of comparison, and look at the process of evaluation itself and how it can be improved to make it more consistent and less open to dispute.”

Sentences from the test sets were also evaluated independently from one another. When assessing a document, one mistranslated sentence could render the whole translation unfit. Breaking down documents and evaluating at the sentence or segment level has been shown to somewhat level the playing field when it comes to comparing human and machine translations.

The assessments were also crowdsourced, as opposed to necessarily being carried out by professional linguists. A lot of previous research has shown crowd assessments to be much more tolerable of issues in the translated output.

Therefore, while human parity may have been achieved in the particular assessment in question, the dice may have been somewhat loaded. However, what’s done is done. The question is, how can we do better in future?

Refining the evaluation of MT

The notion of human parity inherently pits machine translation against the human translator. I fundamentally believe that is the wrong way to frame the conversation, but that’s a topic for a future issue! Let’s step away from the idea of comparison, and look at the process of evaluation itself and how it can be improved to make it more consistent and less open to dispute.

Test data.

There’s an adage in AI that systems are only as good as the data used to train them. This applies to MT, and it also applies to the data used for evaluating translations. Ideally, test sets should contain original source material and not translated data from another language. Similarly, the reference translations — that is, the human translations against which all judgements will be made — should be good translations, and reflective of the source to the extent possible. This might seem like an obvi-ous statement, but it is not always the case in practice.

While there will inevitably be some overhead in creating and validating test sets in this way, the ends should justify the means if we are interested in getting a true reflection of how good the translations actually are.

Test suites and context.

On the theme of test data, more recent work has proposed the creation of specialized test suites for evaluating translations.

That is to say, rather than creating test sets from a random selection of sentences, they should comprise text that exhibits a range of linguistic characteristics. These characteristics could be anything from punctuation, dates, and currency notation, to long distance dependencies, compound nouns, and ambiguities. This is a more holistic approach to the task that can give us the comfort that, if an MT system does well in an evaluation, it has well and truly covered the bases.

Related to this is the idea, mentioned earlier, of sentences being evaluated as part of a larger body of text rather than independently. In a perfect world, we might even have test suites of documents, but maybe we should learn to walk before we start running!


Lastly, we come to the question of who actually carries out the assessments — who are these arbiters of quality? Many of the larger scale assessments have been carried out via crowdsourcing out of necessity, though in that perfect world we talked about, assessors would be professionals — linguists, translators — who can capture those nuances inherent to translation.

Regardless of who carries out the evaluations, they should always have access to the source text and make judgements on the accuracy of the translations, as well as the fluency. Believe it or not, there are cases where assessors only have access to the target, and judgements are made based on the fluency of the out-put. Assessing translations using a combination of both fluency and accuracy (also known as adequacy) should be a minimum.

Looking beyond graphs

At the end of the day, there is a balance to strike between the efficacy of the process and what is practical. However, given that we are still relying heavily on flawed metrics like BLEU scores, there is clearly room for improvement.

If you’re planning on carrying out your own evaluations in the near future, make sure to consult the large body of recent work for some guidance and guidelines.

And the next time you’re reading an evaluation report, white-paper, or blog post, remember to ask questions! Maybe it will help you to see beyond a pretty graph and identify cases that are built on less than solid foundations. As with MT itself, the evaluation process is never going to be perfect, but we should strive for it nonetheless!