In between my two graduate school years at Middlebury Institute of International Studies (MIIS), I had the pleasure to work as an intern at eBay on the machine translation (MT) team. During my internship, an opportunity arose to explore the challenges faced by statistical MT (SMT) when translating from English into Simplified Chinese.
As you may know, SMT can make many types of errors when generating an output. My research focused on the issue of word order (WO). This article aims to detail four types of words that can cause WO errors in SMT from English to Simplified Chinese, and offer a solution to tackle these challenges. The sample for this research was selected from the training data for one of eBay’s Simplified Chinese MT engine prototypes (eMT) with 726 sentences, 16,068 words, total. The research examined 140 sentences and identified 16 words and parts of speech that present WO challenges.
One of the main differences between English and Chinese is the position of modifiers, which makes MT between the two languages very challenging. In English, the position of modifiers is relatively flexible. It can be placed before, after or even several words away from the key phrase that is being modified. For example, one can say, “Regularly, I swim,” “I regularly swim,” or “I swim regularly.” However, in Simplified Chinese, in most cases the modifier has to come before the key phrase it modifies. This only leaves us with one idiomatic way to express the same meaning: “I regularly swim.” If “regularly” is put anywhere else in this sentence, it sounds awkward or confusing and in other cases can even create meaning errors.
So what kinds of modifiers do translation engines need to look out for when translating from English to Chinese? What are the main culprits of WO errors in MT? In order to detect specific modifiers that cause WO errors, we examined a sample of the training data.
Adverbs and adverbial clauses
As shown in the example above, in Simplified Chinese, adverbs do not have as much freedom as they do in English. An English sentence such as “I read newspapers daily” would be best translated as “I daily read newspapers” in Simplified Chinese. In most cases, adverbs have to stay religiously in front of the verb they are modifying in order for the sentence to make sense. Being able to identify and move adverbs in a sentence is crucial to delivering clear idiomatic MT output. There are of course exceptions. In certain circumstances, adverbs are placed elsewhere. However, such circumstances did not occur in the sample that was examined.
There are several kinds of adverbial clauses such as the ones of time (since), of condition (if), of purpose (in order to) and of result (so… that). The only two that occurred in the training data were if for adverbial clauses of condition (such as “you will get a candy if you behave well”) and once for adverbial clauses of time (such as “I will visit you once I’m back”). Since adverbial clauses are dependent clauses that modify the main clause, they have to be put in the front. Let us take if, for example. An English sentence that reads “You will see him if you live in the neighborhood” should be translated into Simplified Chinese as “If you live in the neighborhood, you will see him.”
In these cases, translation engines have to identify the adverbial clauses and move them in front of the main clause in order for the output to make sense in Simplified Chinese.
Attributive clauses are also dependent clauses. They start with relative pronouns such as who, where, which, how and that. These culprits are tough to deal with because Simplified Chinese does not have attributive clauses. So how do we express ideas that in English require attributive clauses? Here is an example. It might be a bit convoluted sounding in English (because English natives would not say it like this), but it sounds very natural and succinct in Simplified Chinese.
English: “You will see the boy who delivers newspaper.”
Simplified Chinese: “You will see the delivers-newspaper boy.”
We would either break the attributive clause into another sentence or convert it into an adverb or adjective and place it in front of the key phrase. In order to maintain the flow of meaning and the cohesion of logic, we would usually convert it into an adjective or an adverb. And by now we know the rules with adverbs — in most cases, they have to stay in front of the words or phrases they are modifying. So for attributive clauses, the engine has more and tougher steps to do. It not only needs to identify the clause and translate it into an adjective or an adverb, but also move the translation in front of the key phrase, which can be quite difficult given that the key phrase can be anywhere in the sentence. For example, in the sentence “I saw the girl the other day who had the tickets I needed,” the word “girl” is what the attributive clause is modifying, but it is separated from the clause by another phrase. Therefore, the engine has an additional challenge with attributive clauses, which is to correctly identify where to move the translation.
So far, these examples are sentences simple enough that WO errors create awkwardness, but not too much confusion. But what if they gang up on us? This following sentence is a combination of the previous three shorter sentences. It is a perfect example to demonstrate just how messy it can get when all these modifiers come together in one sentence and they all need to be moved to different places. I marked the different components by color to show how different the translated sentence is from the original.
English: “You will see the boy who delivers newspaper daily, if you live in the neighborhood.”
Simplified Chinese: “If you live in the neighborhood, you will see the daily delivers-newspaper boy.”
Even in this not-so-worst-case scenario, translating in the wrong word order can create great confusion if not meaning errors. If modifiers are placed in the wrong spots, they will modify something totally different than what was intended. The reader simply would not be able to understand the message being conveyed.
Now, you might not believe me when I say that prepositions are even bigger troublemakers for SMT. Prepositions include words such as after, as, at, by, during, for, from, in, of, on, with, within and without. The reason they are even bigger culprits is twofold. First, the troubles they cause are hard to fix. Not only does the engine have to recognize these prepositions, but it also has to correctly identify the meaning units they carry and then move these units to their appropriate spots. With attributive and adverbial clauses, there is often a clear meaning unit marked by conjunctions and punctuation marks. For example:
“You will see him if you live in the neighborhood.”
“This is the best product that our store carries.”
However, the boundaries of preposition meaning units are less defined.
“I found him at the bookshop near Starbucks.”
“I followed him for ten minutes as he strolled around on the street.”
Second, prepositions-related WO errors can cause significant meaning errors, more severe than confusion and awkwardness. For example, in Simplified Chinese, there is no translation for the word of. What we do have is an equivalent of apostrophe-s (’s), so an English phrase like “the wings of the butterfly” can only be translated as “the butterfly’s wings.”
The engine has to translate the of-structured phrase into a phrase using ’s. If the engine does not know to switch the word order and only knows to substitute “of” in English with the equivalent of ’s in Simplified Chinese, then it will translate the sentence as “the wing’s butterfly,” which does not make sense at all. What’s worse, an English sentence that reads “The solution of problems requires reporting to management,” which should be translated as “The problems’ solution requires reporting to management,” will unfortunately be translated by the engine as “The solution’s problems requires reporting to management.” What a meaning error!
Here is another example of meaning errors caused by WO errors from prepositions. In Simplified Chinese, there is no equivalent of the word after, but only of the word then, as in “do something, then do something else.” Thus, a sentence that says “Go to bed after you brush your teeth” should be translated as “Brush your teeth then go to bed.”
If the machine did not learn to switch the word order and only knows to translate English “after” into Simplified Chinese “then,” the sentence will read “Go to bed then brush your teeth.”
Consequently, a sentence that we see a lot on ecommerce websites is “You will receive the item after you pay.” This should really be translated as “You pay, then you will receive the item.” However, it may be erroneously translated by the engine as “You will receive the item then you pay.”
Millions of dollars could be lost due to this seemingly small problem. It is clear that we must do something to train the engine to fix the errors caused by prepositions. However, it is easier said than done. Identifying prepositions is surely easier than identifying adverbs since there are a limited number of prepositions. However, what is challenging at this stage is how to teach the engine to identify the meaning units carried by the preposition. That is a question that is yet to be answered.
So how do we proceed after identifying these miscreants? Logically, we would want to document all their misdemeanors. Using regular expressions in quality assurance tools such as Okapi CheckMate for the words or types of words identified in the sample data, I was able to find all their occurrences in the training data. Then I generated a spreadsheet in Microsoft Excel listing all the sentences containing these words. Next, I reviewed a sample of the sentences for each word to see how many times eMT incorrectly translates that particular phrase. After that I was able to compile the priority list shown here. Let’s call this the mugshot of the culprits, as seen in Figure 1.
Priority is ranked according to how often a word shows up and how often it is likely to cause problems. The word with the highest percentage, of, is the word that occurs the most times in the file. Most of the time it creates WO issues, and most of the time eMT gets the WO wrong.
The priority of each word visualized on a pie chart is shown in Figure 2.
Of is first on the list, accounting for 25.7% of WO issues, which means it should be a top priority if one were to improve the quality of the translation output. If we can fix the WO issue of of, then we fix 25.7% of our WO issues. After that we can move on to the second priority, in, and so on.
So how do we go about fixing WO issues? The current proposed solution that we designed with our human language technology (HLT) research scientist is to annotate sentences with numbered strings. Figure 3 shows an example:
The English sentence is the source. The first Simplified Chinese translation is the eMT output. The second Simplified Chinese sentence is the post-edited translation. Each string that needs to be moved is numbered in the eMT output. (These strings are underlined here with colored lines to highlight.) As shown, strings 1, 2 and 3 are moved to the beginning of the sentence and are switched to 3, 2 and 1. By numbering the strings that need to be moved in a sentence, linguists can help HLT research scientists to write algorithms that train our engine to better calculate the position of these modifiers when translating from English into Chinese.
Simplified Chinese can be a difficult language to learn. There is not really much grammar to it; yet if things are not expressed in exactly the right order, then — as a native would say — it just doesn’t sound right. But one may argue that this is the case with all languages. Lucky for our translation engine, there are still some patterns to follow. What it needs to do is to identify and recognize these modifiers, the meaning units they carry, as well as the key phrases that they are modifying. Then our engine will be one step closer to putting the puzzle pieces together and bringing these culprits to justice.