Unpacking the Black Box
Unpacking the Black Box
Metrics, what are they good for?
John Tinsley is co-managing director of Iconic Translation Machines, which joined the RWS Group in 2020. He has worked with machine translation over the last 16 years in both academia and industry. In 2009, he received a PhD in Computing from Dublin City University, after which he co-founded Iconic. He now leads Iconic as a dedicated technology division of RWS Group.
Edwin Starr’s 1970 Number 1 hit “War” repeatedly asks the question, “War, what is it good for?” and the response is always “nothing.” You could ask the same question of many things — garlic crushers, Roombas, automatic evaluation metrics for machine translation (MT) — apply the same answer, and have the majority of people agree with you.
There has been a lot written about evaluation metrics and BLEU scores, and indeed this column is now another to add to the pile. However, maybe this one can serve to draw somewhat of a line under the topic. That would be a fine thing! What it will hopefully do, at least, is act as a reasonable go-to guide to explain what metrics are out there, what exactly they do, how they differ, and offer some light at the end of the tunnel of alternatives.
You probably already have a general idea about how these metrics work, but let’s be explicit for the avoidance of doubt. Essentially, they take some output from an MT engine and compare it to a “correct” version of the translation. This correct version is known as a reference translation and has typically (ideally) been produced by a human translator.
The comparison is based on some algorithm that looks at the differences between the MT output and the reference, i.e. how well the MT correlates with a human translation. These algorithms can range from simple ones that look at how many words the two translations have in common, to more complex ones which look at character-based n-gram similarity.
The benefits of these metrics are clear — they are quick to run and allow us to carry out rapid, iterative assessments of multiple MT engines. The drawbacks of these metrics, however, require their own section.
The fundamental flaws
The single biggest flaw with these metrics is the concept that MT output is compared to and judged based on a “correct” reference translation. As I’m sure most readers will agree, there’s no such thing as a single correct translation. There are often very many ways to adequately translate a given source text. Therefore, even if it is correct, if the MT differs from the reference translation, it will be penalized. As we’ll see shortly, some metrics have tried to account for this in their algorithms with varying degrees of success. Another suggested workaround for this issue is to have multiple reference translations, but that is not very practical in most scenarios and is rarely used.
The next biggest flaw with these metrics is that the scores they produce have very little meaning in absolute terms. So your Russian to English MT engine has a BLEU score of 45 — what does that mean? Absolutely nothing, literally. Scores might be informative for a relative comparison, like if another Russian to English MT engine tested on the exact same documents had a BLEU score of 41, we could say with some degree of certainty that the first engine is probably better than the second one. But we cannot say if either of them are actually any good. Beware anyone trying to tell (sell) you otherwise!
Scores generally aren’t comparable across different languages or on different test documents either, so their application is quite narrow. Even when you try to use common sense, they can confound. For example, you might think the higher the score the better, and you’d be correct. However, experience tells us that if a score is too high, the more suspicious it is, and maybe there was some mistake made during testing!
These flaws are openly acknowledged by developers and users, and there has been a concerted effort to refine and improve them over the past 15 years. This has resulted in a large number of different metrics in the field with varying degrees of adoption.
What are all these metrics and what is the difference between them? Table 1 offers a non-exhaustive (but not far from it) attempt to explain; questionable acronyms included.
Other metrics that exist but are not detailed here include GTM, AMBER, BEER, GLEU, MP4IBM1, YiSi, MTeR-ATER, and PER.
The present and future of automated metrics
Despite their known flaws, these metrics are still used extensively. Particularly the BLEU score, which was even acknowledged by its developers to be limited. The reason for this is really a case of sticking with the familiar in the absence of a silver bullet alternative. Not to be hypocritical, I used them back in my own research days and we still use them at RWS Iconic. Typically, practitioners will tend not to rely on a single metric, but rather calculate scores from a variety of different metrics, as a kind of sanity check. However, important decisions, particularly when it comes to buyers choosing whether to use MT (and/or which MT), will additionally rely on human assessments.
Fortunately, all is not lost when it comes to automated metrics. There is a lot of R&D effort spent on trying to create new and better metrics. As part of the annual WMT Conference, there has been a shared task related to new metric development running since 2008, with the next edition in November this year. This task focuses on evaluating the evaluation metrics themselves, by seeing how well they correlate with human assessments.
The oracle of evaluation, however, is quality estimation. A perpetual hot topic in MT, quality estimation aims to predict the accuracy of machine translations without needing to com-pare them to reference translations. Not only is this very useful as a product feature, it would be ideal as an absolute evaluation metric.
This will no doubt be an upcoming topic in these pages, and research in this area is ongoing to make this a reality. Until then, use automated metrics in the wild with caution! Correctly interpreting what they are telling you takes a dollop of experience along with a serving of intuition. Without this, a little knowledge can be a dangerous thing.
Edwin Starr’s 1970 Number 1 hit “War” repeatedly asks the question, “War, what is it good for?” and the response is always “nothing.” You could… Continue Reading