Rethinking LQE Metrics | MultiLingual September 2023

ANALYSIS

Rethinking LQE Metrics
The impact of sample size on the perception of quality

BY Marina Pantcheva

Abstract

The MQM 2.0 industry-standard metric for measuring translation quality disregards the size of the reviewed sample: one error in a 100-word sample is just as bad as 10 errors in a 1,000-word sample. This tenet is challenged by linguistic theory, according to which the sensitivity to language rule violations is non-linear.

We set out to validate linguistic theory against data and explore the dependency between error counts, sample sizes, and the overall quality experience ratings in real-life linguistic quality evaluation (LQE) reviews. The results from the data analysis indicate that the smaller the evaluated text sample is, the less sensitive reviewers are to the errors it contains.

These results are in line with linguistic theory but challenge the industry standard MQM 2.0 quality metrics raising the question whether the MQM metrics should be revised to better reflect the perception of quality by speakers.

The linearity of MQM 2.0 quality scores

Linguistic quality evaluation relies heavily on the existence of standardized metrics for measuring translation quality. Such metrics aim to produce a numerical score derived from errors annotated against a reviewed text. The numerical score reflects how close the translation quality is to the “perfect” quality, that is, a translation without any errors.

The most widely used quality evaluation framework is the multidimensional quality metrics (MQM 2.0), by now established as an industry-standard framework for defining quality evaluation metrics. The MQM 2.0 metric contains a few key components to consider:

Absolute penalty total (APT): the grand total of all penalty points produced by the annotated errors.
Evaluation word count (EWC): the count of source words in the sample translation that was subjected to quality evaluation during review.
Maximum score value (MSV): a perfect (maximum) score on a scale that is familiar to the quality manager and users of the evaluation. The default MSV is 100.
Overall Quality Score (OQS): a numerical value, typically on a scale 0-100, which reflect the quality of the evaluated sample calculated from the APT and EWC values and multiplied by the maximum score value.

The overall quality score is calculated as follows:

Overall Quality Score = (1 – (Absolute Penalty Total/ Evaluation Word Count)) * Maximum Score Value

Leaving the technical details aside, the MQM 2.0 quality score formula represents a linear function. As an illustration, consider the fact that a 100-word translation containing oneminor accuracy error produces the same overall quality score as a 1,000-word translation containing 10 minor accuracy errors, and the same for a 10,000-word translation containing 100 minor accuracy errors. Suppose that a quality manager defines an LQE passing threshold at 10 penalty points for every 1,000 reviewed words, then all of the reviewed samples in Table 1 pass the quality bar.

Table 1: Correlation between absolute penalty total and evaluated word count in MQM 2.0

Human perception of translation quality might not be linear

The linearity of the MQM overall quality score metrics is a convenient way to extrapolate quality scores and flexibly manage LQE of various sample sizes. The question is, however, whether this formula accurately reflects the way human speakers perceive linguistic quality. There are reasons to believe that it does not, and the evidence comes from theoretical linguistics, specifically, from the way children process deviations to linguistic rules and exceptions from regular language patterns.

In his work on child language acquisition, Charles Yang investigates children’s sensitivity to linguistic irregularity. Yang notes that children can correctly postulate a productive grammatical rule despite the existence of exceptions to this rule. For instance, children derive the rule that the English past tense is formed by adding -ed to the verb despite the existence of irregular verbs, such as “went” and “saw,” that do not follow the established past tense formation rule. Importantly, in order for children to postulate a rule, the number of deviations from this rule must fall below a critical threshold. This threshold is defined as follows:

θN := N/lnN

If a language has a grammatical rule that applies to N cases, the maximum count of exceptions that a child can tolerate without discarding the rule is θN.

As an example, consider the following hypothetical case (Table 2). A language has a past tense formation rule that applies to 100 verbs; then children who acquire this language will tolerate maximum 22 exceptions (22%) to the rule in order to consider it valid. In case the exception count is higher, children will not postulate the rule. If the rule applies to 1,000 verbs, then children will tolerate 145 exceptions, which is 14 %, i.e., by six percent points lower than in the case of 100 verbs.

Yang labels this principle the Tolerance Principle. The most interesting property of the Tolerance Principle is that the threshold for tolerating deviations from language rules is not a linear function, but a log normal function of the sample size.

Table 2: Tolerated exception rate for different sample sizes according to the Tolerance Principle

The Tolerance Principle

Let’s contemplate extending the Tolerance Principle to the perception of language errors in a sample text, assuming that errors are a sort of “exception” from linguistic rules to which human speakers are sensitive. Under such a hypothesis, we’d expect that speakers’ sensitivity to errors, which are deviations from linguistic rules, does not follow a linear dependency, but a logarithmic one: the larger the sample, the fewer errors are accepted when normalized against a reference sample size (for instance, 1K).

If we directly apply the Tolerance Principle to LQE score calculation, the MQM numbers shown in Table 1 change as visualized in the middle column of Table 3.

Under this hypothesis, one error in a 10-word reviewed sample indicates superior language quality compared to 10 errors in a 1,000-word sample because of the shifting error acceptance threshold. Compared to MQM 2.0, an LQE metric based on the Tolerance Principe is:

More permissive for smaller samples: tolerating by one third more errors than MQM 2.0 in a 100-word sample
Stricter for larger samples: allowing for roughly half of the penalty points allowed by MQM 2.0 in a 100K-word sample text.

This gives rise to the following questions:

Is a linear LQE metric the right metric to use?
Is there an objective way to test whether the Tolerance Principle extends beyond child language acquisition to the area of language quality evaluation?

Table 3: Correlation between Absolute Penalty Total and Evaluated Word Count under the Tolerance Principle

Detecting the effect of the Tolerance Principle in LQE

To answer the first question, we must begin by exploring the second one: Can we measure the effect of the Tolerance Principle on language quality evaluation?

In order to conduct these measurements, it’s necessary to analyze LQE reviews that have the following characteristics:

The reviews are run on samples with different word counts.
The reviews produce an absolute penalty total – a grand total of all penalty points generated by the annotated errors.
The reviewers provide a subjective evaluation of the overall quality of the reviewed sample – such subjective evaluation is typically elicited through an overall experience rating on a predefined scale, most often from 1 to 100.

While all three components are key, the overall quality experience is by far the most interesting. It directly reflects the tolerance of the human reviewers to the errors they find in the reviewed sample. It is a subjective assessment reflecting the generic perception of the language quality. The remaining two data types are objective measures, quantified or calculated based on hard data.

Methodology and data analysis

In the previous section, we laid out the three input data types necessary to test the hypothesis that reviewers’ perception of language quality follows a non-linear dependency on the size of the reviewed sample. The input data are:

Sample size = evaluated word count
Total sum of penalty point = the total sum of all penalty points produced by the annotated errors
Experience rating score = the subjective evaluation of the linguistic quality provided on a scale 1-100.

To proceed with the data analysis, we selected a little over 8,000 LQE reviews that met all three requirements: varied sample sizes, recorded penalty points and availability of experience rating. The selected LQEs covered more than 30 languages. They all came from the same project and were subject to the same evaluation criteria. With this, we aimed to reduce the randomness of the data.

For the analysis itself, we used the following methodology:

Step 1: Clean up the data: remove erroneously filled-in LQE forms as well as the outliers in the data set.
Step 2: Categorize the LQE reviews in the database into “sample size bins” according to their evaluated word count.
Step 3: For each LQE in the database, derive a normed penalty total, loosely referred to as “penalty points per one thousand reviewed words.” This measure is calculated by taking the total sum of penalty points, dividing them by the evaluated word count and then multiplying by a reference word count whose standard default value is 1,000 words.
Step 4: Establish a relationship between the normed penalty total (penalty points per 1K words) and the experience rating provided by the reviewer.
Step 5: Map out the dependency between the relationships established in Step 4 and each sample size bin.
Step 6: Compare each of the dependencies across every sample size bin.

Sample size bins

A special section is due to address the categorization of samples into bins. For the purpose of the analysis, we divided the samples into the categories listed in Table 4.

We discarded two bins with samples larger than 4K words or smaller than 100 words due to insignificant count of samples in that category.

A look at Table 4 reveals that there is an uneven distribution of LQE reviews across the bins. This is due to the strong preference for a single standardized sample size of 1,000 words, i.e., the standard reference word count. For this reason, bin 3 contains 84% of all the samples.

Figure 1: Correlation between experience score and penalty points per 1K for each sample size bin

Results: the larger the sample size, the worse the experience rating

The findings from the analysis are shown in Figure 1. The horizontal axis denotes the count of error (penalty) points per a normalized sample of 1K words. The vertical axis represents the Reviewers’ experience rating score on a scale from 1 to 100. Each bin has a regression line in a dedicated color. We can distinguish three groups:

Bin 1 (small samples): yellow line
Bins 2 and 3 (medium samples): green and red lines.
Bins 4, 5, and 6 (large samples): purple, brown, and pink lines.

Focusing on the horizontal line representing an experience score with the value 80, let’s zoom in on specific points. Our interest lies in determining what error-points-per-1K score corresponds to an experience score of 80 for each bin. This can be easily observed by identifying where the thick black horizontal line intersects the regression line of each bin.

Bin 5 and 6 (largest samples with 2-4K words) – the corresponding error-points-per-1K score is eight.
Bin 1 (smallest samples of less than 500 words) – the corresponding error-points-per-1K score is 15.

This implies that in large samples, an experience score of 80 correlates with eight normed penalty points (error-points-per 1K reviewed words), while in small samples, the same experience score correlates with 15 normed penalty points – nearly twice as many. Consequently, the identical experience rating scores correspond to fewer logged errors in large samples and more logged errors in small samples.

Let’s approach the data from another perspective and consider 15 normed penalty points as the pivot. What experience rating does a score of 15 penalty points correspond to? For large samples (bin 6) it is approximately 68, whereas for small samples (bin 1), it is 80. Therefore, the same normed penalty points score corresponds to a lower Experience rating is large samples and to higher Experience rating in small samples. In simpler terms, the smaller a reviewed sample is, the better experience evaluation reviewers tend to give for the same overall error score.

Table 4: Categorization of LQE samples into bins according to their word count

Conclusion: Sample size matters

In this article, we investigated the idea that the linguistic Tolerance Principle is manifested through a non-linear perception of language quality. We tested the hypothesis against real-live LQE data by analyzing the dependency between error count, sample size and the overall quality experience rating. The findings were aligned with linguistic theory: The data shows a clear trend for reduced sensitivity to errors as the sample size decreases. This is evidenced by the fact that reviewers report better quality experience for identical error-points-per-1K scores as the reviewed sample become smaller.

Practical applications

Regarding practical applications, one of the primary considerations is to conduct LQE reviews on samples that have the typical text length that users encounter. Such data can be extracted from page view telemetry, web analytics tools, and similar sources. By focusing LQE reviews on sample sizes that users commonly engage with, we can gain a more accurate understanding of their perception of language quality.

Another important consequence of a non-linear LQE metric is its potential to address the longstanding challenge of failing LQE review when conducted on small samples. Many quality managers cringe at the prospect of running LQE reviews on limited samples, knowing that the risk of failure increases as the sample size decreases. However, a non-linear LQE metric that incorporates the Tolerance Principle is up to 30% more lenient when applied to small samples. Consequently, this decrease in stringency lowers the likelihood that a single logged error will lead to a disastrously failed LQE review.

Recommendations for further research

This paper presents initial discoveries and aims to inspire further exploration and validation of the idea that the sensitivity to language quality follows a non-linear dependency. Further research should seek to eliminate possible data biases that surfaced in the current analysis:

Balance the count of samples in each data bin
In the present research, the distribution of samples is uneven due to a strong preference for a single standardized sample size of approximately 1,000 words. Future studies should strive to balance the sample distribution to obtain more robust and representative results.
Focus on mono-cultural LQE reviews to account for cultural specificities
During the data analysis stage, we observed a clear tendency in some cultures to provide unexpectedly positive experience ratings despite a very low normalized penalty total. This phenomenon is attributed to the existing cultural norms according to which overt criticism is considered impolite and inappropriate in those cultures. We therefore recommend to conduct research that concentrates on mono-cultural reviews to eliminate the effect of cultural nuances on language quality assessment.

By addressing these areas of concern, future research can advance our understanding of the non-linear relationship between language quality perception and reviewed word count, leading to more accurate and practical applications of this knowledge and eventually to a new, non-linear, formula for calculating the overall quality score.

Acknowledgements

A multidisciplinary team worked on this research project in the period 2019-2021. I would like to express my gratitude to all involved colleagues who spent many Friday afternoons pondering over data and stats:

Michal Pustejovsky (mathematical engineering), Stanislav Kudelas (data and information management), Kamila Chramostova (BI data analysis), David Odler (data engineering), and especially to our dearly departed colleague Ivan Drienik whose unwavering belief in the project empowered the team to initiate and successfully complete the work.

Marina Pantcheva is a polyglot with a passion for languages, data, and structure. She combined all three in her PhD research on nanosyntax. She currently works at RWS leading a multidisciplinary team that focuses on crowd localization solutions.

Back to Issue

Workflow

The Rebirth of Post-translation Analysis

By Istvan Lengyel

Trends come and go – crowdsourcing and transcreation, just to name a few – but since the advent of servers and the cloud, one trend…

→ Continue Reading

WEEKLY DIGEST

Subscribe to stay updated between magazine issues.

MultiLingual Media LLC

The Rebirth of Post-translation Analysis

Weekly Newsletter, Subscribe to stay updated!

Login or Register