Evaluating quality in translation

“The proof of the pudding is in the eating.” This old saying dates back to the seventeenth century and is widely attributed to the Spanish author Miguel de Cervantes in his world-famous novel The Ingenious Gentleman Don Quixote of la Mancha. It can be paraphrased as “you can only say that something is a success after it has been tried out.” Applying it to translations you could say: “the test of a translation is in its use.”

The changing perception of translation quality has received much attention in both academia and industry in recent years. In the past, human or publication quality was the only target for translation buyers and vendors. One translator said on a translation forum a couple of years ago: “When I translate, my main aim is to produce a document which reads as if it were originally written in English.” Though high quality is the target for most translators, some of today’s customers may want something else. Also, a translator’s work might be excellent in terms of fluency (meaning it sounds natural or intuitive), but how about the adequacy of the translation (and its fidelity to the source text) or errors made based on an error typology (such as terminology, country standards and formatting)? The translation can be of a very high quality according to some standards and be a bad translation according to others. In other words, fluency is just one side of a coin that comes with multiple sides.

Compliance versus acceptance

Today, there is an increasing appetite for a new approach to quality within the industry. Quality occurs when the customer is satisfied. As a result, translation quality evaluation needs to refocus on a number of cost-effective, practical issues. First of all, a translation is expected to fulfill certain basic criteria in order to satisfy the average user. For this reason, each evaluation project should measure the degree of compliance between translated content and a benchmark that is based on predefined — and hopefully in the future standardized — quality levels.  These could vary based on publication quality, expert quality, human quality, transcreation, full post-editing, light post-editing, raw machine translation (MT) output and so on. These quality levels, or quality types, if you will, should be specified beforehand by the customer. It adds to the confusion that many of these quality levels are undefined, vague and hard to measure.

Note that I’m now focusing only on compliance and not on acceptance. Compliance does not necessarily mean acceptance by the customer or the user. A pudding can be a perfect pudding according to certain standards, but if the person eating the pudding is not satisfied with the taste, the smell, the packaging, the price or anything else because of some personal preference, the product I deliver is simply not satisfactory. It’s a good pudding according to some criteria but it’s not according to others. In order for it to be accepted, it also needs to fulfill additional user specific requirements.

In the past, compliance wasn’t really a problem. Buyers paid for (and expected to receive) translations that “read as if they were originally written in the target language.” As a result, vendors could mainly focus on acceptance or special requests: deliver faster, use short sentences, use the client specific glossary, avoid negations and so on. Today, even compliance becomes an issue. How can we make sure we provide the right quality? A one-taste-fits-all approach to puddings or a one-quality-fits-all approach to translations is not a satisfactory model any longer due to changing user needs, purposes, technology and budgets.

One way of accounting for this is to ask the consumers what they like. Let them taste the pudding first before we go on producing it in large volumes. The question is: who are they? The answer could be the customer (customer feedback is certainly valuable), an undefined crowd (large and ad hoc group of users and nonusers), a community (more specific group of users) or a selected user-group (even more specific and smaller). People, however, differ in their taste and you can only satisfy the majority. You need to find out about the taste of the “average” end-users visiting your website, buying your product, reading your marketing text and using your software. We live in a personalized world. Now that we’ve mentioned puddings, through cookies, web content is packaged to our personal needs and preferences.

Living in the country of tulips, I’ve learned one thing. The bulb of a tulip is very important. It provides the storage where nutrients are retained and kept safe for the future. Just like the plants themselves, bulbs differ in form, size and quality: usually, the bigger the bulb, the more nutrients it contains and the larger the plant and the flower it produces. If you want to augment your chances of growing a nice tulip, get a big bulb. Size does matter — at least when it comes to tulips. And there is a minimum bulb size that can be sold on the market. This is measured in zift (Figure 1). This measurement is used internationally to determine the quality and the price of each bulb. And, of course, there is also an average size. An average zift size for tulips, for instance, is 11-12, which means that the circumference of the bulb is 11-12 centimeters.

There are two types of translation vendors today. The first type has a traditional view on quality. They advertise the quality level of their services with superlatives. They make sure that you know they always provide top quality translations and their services conform to all existing quality standards. They are like the flower shop selling only the biggest and most expensive bulbs. There is now a second, emerging group of translation service providers offering a progressive view on quality. They provide multiple levels or types of quality and related pricing. Very often, they use various processes and even different personnel to ensure that the targeted quality is reached in the most effective way and, consequently, within the budgetary restrictions of the client. Translations are just like tulip bulbs, aren’t they? The problem is, common standards for measuring translation quality in a reliable way are missing. There are no international metrics or benchmarks to compare different levels of quality. However, the Dynamic Quality Framework (DQF) industry standard, launched in 2011 by TAUS and co-created with over 50 companies and organizations, is definitely a candidate for becoming an international standard. Chances are that this initiative will be internationally recognized in the near future.

To draw a parallel with the zift method applied in the flower industry, we could define the minimum requirements for a translation (compliance) and add user preferences (acceptance).  We could also specify an average score for translations by calculating industry averages for quality evaluation (QE) on translations passing the minimum threshold. This way, we would be able to deliver superb, average, good-enough quality and everything in between. With puddings, user preferences are expressed in the form of specifications, such as that it should be sweeter than the norm, darker or thicker. When speaking about translations, we can specify the necessary criteria for compliance by calling it an “average.” But a better way is to apply a metric — perhaps a mixture of adequacy, fluency and error typology — and specify a minimum threshold as well as an average score for that type of content. Everything above the average score is personal preference, and should be paid extra for. Everything below qualifies for a discount. But how do we come up with the right metrics and the right benchmarking? And how do we provide thresholds for the different levels of quality, from MT raw output to publishable translation?


Translation as a five-star hotel

Let’s forget about puddings and tulips and consider another example, now from the hospitality industry: hotel rating. A national body defines the requirements a hotel needs to fulfill in order to obtain a number of stars. Under one star, we talk about a bed and breakfast, a hostel or a camp ground, but not about a hotel. Why not do the same with translations? The budget or basic quality translation is one star. Comprehensibility would be a right criterion for this basic level. Gisting is another word that comes to mind in this context. There are different evaluation types that can be useful here such as readability or usability. Under one star, the translation is not a translation anymore since a large part of the text is incomprehensible. To be at least one star you need to score a minimum on some standard metric. For two stars you need a higher score. If customers want a five-star translation, they need to pay for it. Do they want it at record speed? They will pay more. They have a lower budget? You offer a three-star translation. This may sound absurd, but that’s where the industry is heading right now, and with a good reason.

One method of choosing the right type of evaluation or the right mix of metrics is profiling content based on utility, time and sentiment. Utility refers to the relative importance of the functionality of the translation; time is the speed with which the translation is required; and sentiment refers to the importance of impact on brand image, meaning how damaging low quality translation might be. According to Sharon O’Brien’s article published in the January 2012 issue of The Journal of Specialised Translation, “A dynamic Quality Evaluation model should cater for variability in content type, communicative function, end user requirements, context, perishability, or mode of translation generation.”

As mentioned previously, once we have specified a basic level, we need to come up with an average score for translation quality. The only way to create an objective benchmark is by collecting a large number of evaluation data obtained from different evaluators, from different domains, genres, language pairs and so on. Human evaluation remains subjective unless you have gathered data from a large amount of users. Is community evaluation the way to go?


Community evaluation

Last June, a number of participants from the TAUS QE Summit took up the challenge to define best practices for community evaluation of translations. Community evaluation involves capturing the target audience preferences by actually crowdsourcing the QE process itself. What you want is that the quality of the content meets the expectations of customers and users and you also want to engage these groups in the production cycle to increase brand loyalty and help sharpen the focus of your offering or content.

This type of evaluation usually involves opening an online collaboration technology process for volunteers to help review translated content. These volunteer evaluators build a collaborative community where they participate as reviewers — of the content and often of one another — and their work becomes visible immediately. By the way, the term community seems to be preferred: it suggests coherence. People like being part of a community. Even if you start with a “random” crowd, the objective should be to build a community.

But wait a second! Can I trust crowdsourcing? How do I make sure the evaluation provided is on par with professional evaluation? In a recent paper entitled “Crowdsourcing for Evaluating Machine Translation Quality,” Shinsuke Goto, Donghui Lin and Toru Ishida of Kyoto University undertook a pilot for crowdsourcing MT quality evaluation for Chinese to English translations. They compared crowdsourcing scores to professional scores and found that the average score of crowdsourced (or community) evaluators matched professional evaluation results.

The reason evaluators choose to join the community can vary. It can be because of an emotional bond with the content, pride in one’s native language or just for the sake of practicing language skills. In big international companies, communities can be formed from very interested users. Evaluation can also be passed on as a game. Knowing your community will enable you to choose the trigger that works best.


Why another standard?

Collecting evaluation data on as much content, from as many evaluators and sources as possible, will give us an idea of the thresholds belonging to different levels of quality. But how do you prove that your translation reaches a given level? Is a quality mark for translations a viable option in today’s translation industry? Of course, ISO standards and certificates to ensure quality are available. Why should we certify the product itself, the translation, if the language service provider (LSP) or its quality management process already matches certain standards? My answer is: certifying the LSP or standardizing the process is just not enough. Technology changes, processes change and translators come and go. It might work if you do the certification every quarter or every month but even then, on the product level it’s unreliable. We need to know the exact quality of the final product.

Moreover, the customer needs to be able to specify the quality level he or she desires and can afford. Most customers have no clue what to say when you ask them about the quality level they want. As I mentioned earlier, there are vendors today offering translation services “tailored to your needs”: from budget translations through first draft translations to professional translations and transcreation. Based on the content and the purpose, you pay for the quality level you choose. Some LSPs ask you to kindly specify the quality level when requiring a quote. The different levels may vary from free to expert translation, with a definition of the different levels. Yet another company introduced the translation strip ticket: “a quick and economical translation service for all of your short texts.” The same company is now hiring translation students for translating. I like the idea of choice here. You can choose to pay less and translate more at a lower quality level. You can save on some content and invest more in other content.

My only problem again is: how do I know that the delivered quality is what I pay for? Can you prove that? Budget translation is too broad a category. The term light post-editing is too vague. Raw MT output varies from domain to domain. The price might be specific but not the exact quality I obtain for that price. It should be quantifiable. It should be comparable. Note: I’m talking about individual translations here and different levels (types if you like) of quality. The process (use of MT, light versus full post-editing and so on) as well as the pricing won’t prove anything about the quality level of the end product.

Independent evaluation

An independent evaluation of translations is the only way to go to obtain a neutral quality mark for translations. There are two options to do that. One option is to use automatic evaluation that is based on a widely accepted standard and correlates well with human evaluation. Automated metrics such as BLEU, TER, GMT and so on compute how similar a translated sentence is to a reference translation or previously translated data. It is assumed that the smaller the difference, the better the quality will be. Automated metrics emerged to address the need for objective, consistent, quick and cheap evaluations. The problem with these metrics is that they require a reference translation, which is unrealistic in an industry setting. They might be useful when training engines but when measuring and comparing the quality of the end-product, they have a very limited use. And what is quality in computer-speak, anyway? What are these algorithms calculating, exactly?

A better option is to send the translation to a third party evaluation service — not an in-house reviewer, translator or another LSP, but a party that is vendor neutral and is using standard QE tools. I know, your first reaction is probably: this would cost too much. And I agree, right now, it is still impossible to have each and every translation certified by a third party. Large and critical projects, however, would definitely deserve such treatment. And there is something called automatic sampling, which helps reduce the cost by reducing the volume in a systematic way. Automatic sampling is to be added in the near future to the TAUS DQF platform.

There is nothing new under the sun. This approach to quality has actually been with us a long time. Just look at cars and houses and of course puddings. Prices have been set for each type of product based on the quality levels. Based on the quality (or the extra features) the price differs. In real estate, you can ask an independent agent to help you assess the quality or value of a house. Why not do the same for translations?


Translators as musicians

In the June 2014 HuffingtonPost.com article “Why So Many Translators Hate Translation Technology,” Nataly Kelly wrote that “Translators prioritize quality, whether the customer does or not. Ask a translator what kind of quality is acceptable, and a professional will tell you, ‘Only the best.’ Would a musician be happy with a subpar performance? No, and neither are translators.” Sure, translators might not like providing imperfect translations. But let’s face it, most translations are imperfect because there is a lack of time, lack of resources, we are all humans and the customer doesn’t want to pay for perfection.  Would a musician be happy? Maybe not. But it’s the producer who decides when the recording is satisfactory and when the band is ready to leave the studio.

Evaluating puddings and translations have one thing in common: you can’t trust only one person’s taste. What you need are many evaluators in order to avoid subjectivity and to obtain a valuable insight on basic and average levels of quality on minimum requirements, compliance and acceptance. To be able to prove that you fulfill the minimum requirements of quality products, you need to create benchmarks based on different attributes of your product. This is to be done by combining evaluation types and collecting huge amounts of evaluation data. Community evaluation might be one way to go for the industry to harvest evaluation data on a large scale and to create benchmarks. TAUS took up the challenge to provide these benchmarks through its DQF initiative. Finally, to show that you provide the right quality, a translation quality mark acknowledged by the industry and provided by an independent third party service would be the most credible way to go.

Who knows, maybe one day we will be able to assign a quality mark to translations automatically. But until then, the proof of the pudding remains in the eating.