A text is fundamentally, intrinsically more than its parts, and therefore only holistic human quality assessment is the ultimately correct metric of translation quality.
Over and over again these days, we hear that machine translation will overtake human translators. Statistical machine translation engines were touted as a great breakthrough, but now they are all but forgotten as we hear about neural machine translation (NMT) allegedly giving much better translation quality. Alas, NMT output reads better, but is already infamous for hiding accuracy errors. Clearly, today more than ever before, we need a reliable, uniform way to measure translation quality.
And all attempts to measure translation quality have so far led to an exploding complexity of parameters, incompatibilities and a stunning variety of metrics.
Why is text so difficult to handle? Why does the world still need copywriters, editors, human translators, terminologists and other types of language specialists — and why will we need them for a very long foreseeable future?
Typically, text is simply considered a carrier of language. The simplicity of text is deceiving — simply a series of marks.
Linguists, of course, and especially professional translators, frequently point out how complex language is. Idioms are often mentioned as a translation hurdle, and they can indeed make just about any translator “go bananas,” as well as completely foil NMT engines.
But understanding the complexity of language requires a much deeper investigation than looking at idioms and the failures of NMT. We would like to bring to your consideration a parallel between text and the advanced area of physics known as complex systems, which we maintain is a new and promising angle of investigation for understanding the nature of language and determining a reasonable approach for quality assessment.
Let us begin by pointing out that text is very, very different as a medium compared to other types of communication such as visual impressions.
Take a look at a picture. In just a few seconds, you can form a fairly complete idea of what the picture represents, even though there are many different elements to consider. For example, the setting might be of a grassy riverbank with large, shady trees in the background. There are many people along the bank, some reclining on the ground, others standing and looking idly toward the water. They are dressed in casual, old-fashioned clothing. The brain takes in all the detailed stimuli at once, draws connections and forms an idea about what the image represents. The brain makes connections between the stimuli, fills in the blanks and expands on the disparate elements.
Now take a look at a page of text. It is fundamentally impossible to glance over it and grasp it as completely as an image. This is because we experience text and understand its contents sequentially. Every word in the text — whether written and read, or spoken and heard — builds on the words that precede it, adding new meaning and gradually completing a thought or message.
All linguistic processes are sequential, and the only way to understand them is by experiencing them from start to finish.
Let us now consider the following list of words:
We can quickly read each word and grasp each meaning. In this order, the words do not form a meaningful message. When the words are arranged in a logical order that builds on each preceding word, a clearer meaning is created and an overall, comprehensible message emerges. “The farmer eats breakfast with his son before feeding his horse.”
Language instantiated in communication is a code. The words form a sequence which we can decode. Arranged in a different sequence, the words form a different message: “The son eats with his horse before feeding his farmer breakfast.”
Human beings read the code sequentially and experience the impact of a meaning and feeling over time, and that impact is effectively the same as the influence of certain changing signals, coming from something quite complex, potentially infinitely complex. The reconstruction of what this code bears is just as complex.
This leads to a critical observation. We maintain that the experience of reading is effectively the same as processing the output or activity of a complex physical system, and the reconstruction of the message is the result of the totality of the reader’s experience.
Text is able to bear all properties that physicists have found studying complex systems.
Consider a training manual. It contains a set of instructions that often refer to information that came before, such as definitions of particular terms that are essential to know in order to understand the instructions. Sentences often build on the ones before. That property is called feedback in complex systems.
Text must also be consistent, as words need to be coherent with the other parts in the text. That is also a requirement in complex systems called compatibility — parts of complex systems cannot be completely disparate for them to work.
The words must also be complementary to each other, with every successive word enhancing the words that came before.
Reasonable variety emphasizes the relatedness between individual words while avoiding focus on any specific ones. The other side of this variety is thought concentration, in which the primary words used focus on certain common elements while the other supporting words have a bit more variety. These key parts and supporting information allow for a balance between commonality and variety.
The concentration on a certain topic can also be termed monocentrism However, this single-minded concentration is only as strong as the weakest part of the system, which can undermine the whole. Frivolous elements of the text, which can still be perfectly correct as individual words, can impede the meaning of the entire text.
External augmentation is another key element of the text system. This essentially refers to how the process of reading constantly draws on what came before. A reader registers new information as part of reading, with each new piece of information steering the holistic message in a new direction.
External augmentation leads to a trend to diversity, which refers to how similar materials tend to be uninteresting and consequently writers and readers alike tend to seek different and more interesting content. Keep in mind, there is also an incompleteness to any system.
Text also tends to adhere to certain standards — in grammar and style, but also content. For cohesion, the internal textual references need to be consistent, adhering to and reinforcing the experience of the reader.
The experience of the reader (of a text) or of the user (interacting with a complex system) starts to render the system more stable, as the latter progressively follows and repeats the rules that make up the context. With this progressive mechanization, the more basic text elements become more fixed, show less variety and stick to certain rules, while the more interesting textual elements carry the new information. This process allows the novel subject matter to truly stand out and reduces surrounding distractions. The main purpose is not bogged down in flowery language that adds nothing important to the message.
This leads to coherence, whereby all the individual parts tie together, building on one another with a singular purpose. When a text is incoherent, parts of the textual whole don’t fit. Those parts can be understood singularly, but have no relation to the wider context, and contribute nothing to the overall meaning.
These abstract concepts adopted from the domain of physics shed light into the nature of text as a media. They demonstrate that text is a code that is deciphered sequentially by the reader, progressively forming an impression consciously intended (or not intended) to be made by the output of a potentially infinitely complex system.
This impression is a totality of the reader’s experience, formed by all the properties mentioned before, together as in a mosaic made of a limited number of colored beads. The beads themselves may be rough and uneven, but they form a picture that can only be seen as a whole.
This is precisely the paradox: even though text is not a picture and is comprehended sequentially, the reader reconstructs the entire message as a whole, which results in a holistic appreciation that is even more complex and multidimensional than a picture. The message that the text bears may be infinitely complex. That’s why the simplicity of text is elusive: literally anything can be expressed by text, a simple sequence of symbols.
The experience that textual material can produce is much deeper than the sum of the letters of the Latin (or any other) alphabet, or the number of words in any language, or even all the existing sentences and paragraphs ever created, and infinitely larger than the entire corpus of Google-indexed content of the entire world on which “neural” models are being taught.
At this juncture it is appropriate to ask a fundamental question. If we understand this infinite complexity, if we agree that the whole is much more than the sum of its parts, why do we believe that counting errors in sentences is an accurate measure of the quality of a translation?
Traditional “analytic” translation quality assessment focuses on the translation of these individual pieces with little regard for the context they occur in, the purpose of the entire work, the audience, culture, or the effect of materials previously read, and so on. Approaching the text piecemeal undermines context that determines whether it is serious or joking, sarcastic or sincere. It’s like looking into the quality of beads while in many cases the beads may be less than perfect. What matters is their sequence, relative position, place in the entire context, and many other things that are totally outside of the narrow scale of detailed examination. Even identical sentences can mean different things in different contexts. The relationship between the writer and the reader (or between the speaker and the listener), the circumstances of the communicative act, the cultural backdrop, the historical context, all these factors and more can affect meaning and impression.
The meaning and the impression — these are the two key factors of the reader’s experience. In the end of the reading exercise we are concerned with only two things: does the message come through accurately and what emotional impact does it make. Accuracy and fluency are key properties of the user experience and they both are experienced holistically, which has been noted as early as 1966. It is therefore clear that the holistic experience must be recognized as a major factor in any attempt to measure language quality, the only true and correct impression of a reader. That’s precisely why a holistic measure can help to finally define “well formed” analytic metrics, which, we can logically deduce, must validate or confirm the holistic impression, since the latter is uncontestable. Another fact that supports our thesis that holistic impression is a fundamental measure of language quality is that, when a reader reads a text, if he notices individual errors, the perceived severity of those errors fluctuates in his mind as he reads the text. He “recalibrates” the errors’ impact to his overall use or enjoyment of the text as he continues to read. What he thinks is a major error when he starts to read the text might be later perceived as a minor one as he absorbs more of the context. Thus, the impact of individual errors is not fixed.
As we read, we might notice that something does not make sense or does not fit well into the “system” that the text constructs, whether grammar, tone or idiom. Taken one at a time, these individual errors might be noticeable without necessarily distorting the meaning of the total written composition. The tense for a verb might be incorrect, a comma might be missing where it would add clarity, a proper noun might be incorrectly written in lower case; all of these are errors that a reader or listener might notice, but are not jarring enough to make the entire work unintelligible.
On the other hand, the meaning is more likely to be distorted by errors that “disturb” the context shaped by the overall text. The meaning of each successive word in a text depends at least partially on the words that have come before it. Complete understanding is wholly dependent on context.
If we disregard the holistic nature of text, then translation technology will become a Procrustean bed, where quality assessment is unreliable, translations degrade in quality, and the real meaning and intent of texts will become of secondary importance if not completely ignored.
The idea of the whole being more than the sum of its parts has been the focus of many prominent philosophers over the centuries, such as Aristotle. Text as a medium for dynamic communication exemplifies this idea very well. Indeed, the whole — the entire impression that text makes in certain circumstances on certain people with certain intended purposes — has a quality and function far deeper than the words or sentences on their own.
Hegel also championed the notion that systems should be viewed holistically, believing that truth lay in studying the whole. This view dominated natural sciences until the seventeenth century, when a mechanistic approach to studying systems took prominence. Their basic components were given the most in-depth focus, with little regard given to the sequential relationship these components had with one another.
In the 20th century, systems theory became popular and the importance of considering the relationships between different parts became increasingly recognized. As shown earlier, elements of systems theory can be applied to text. Text produces a holistic impression based on the whole, which rewards the reader who progressively apprehends the meaning and is driven to take in more.
Thus, the holistic impression of a reader is the ultimate quality measure.
We decide on a text’s value to us as readers based on our ultimate, overall impression of accuracy and fluency, which develops iteratively as we read. Subconsciously, we rate the text as good, mediocre or bad, and this rating could be mapped to a numeric grading scale, such as one to ten, through a methodology of holistic evaluation. Such a grading scale is the ultimate reference for the creation of well-formed analytic metrics for measuring quality. That is one reason, among others, why we need a holistic methodology.