Machine translation and the challenge of Chinese

By John Tinsley September 25, 2015

If you search the web, you will easily find a great deal of information on machine translation (MT) technology: how it can be used, what it’s good for and when it should and should not be used. However, what is not often readily available is why; why does MT quality vary across projects? Why does it make strange mistakes? And, particularly, why is MT better at translating between some languages than others?

From the linguist and language enthusiast’s perspective, the latter question is often one of the more curious aspects of language technology. When we answer this question and dig a little deeper into the challenges that language technologists and researchers are tackling in this area, that’s when things get really interesting.

Broadly speaking, the more similar two languages are to one another, the easier it is to develop an MT solution for them. What do we mean by similar? Ultimately, it boils down to similarities in grammar. More specifically, the closer two languages are in terms of the order of the words in their sentences, their morphology and verb conjugation, the less linguistic and technical effort is required to develop translation technology. For this reason, similar languages, such as those in the Romance family of languages, are typically “easier” for MT.

One language that does not share too many similarities with others across the globe is Chinese. There are more difficult languages on the spectrum, but Chinese is certainly up there. And as one of the most widely spoken languages in the world with more than one billion native speakers, the demand to come up with solutions to address these difficulties is ever increasing. So what is it about Chinese that makes it such a challenge for MT? Let’s take a closer look.

Characteristics of written Chinese

A member of the Sino-Tibetan language family, Chinese has influenced many other major languages such as Japanese and Korean. As a spoken language, there are many distinct regional variations, Mandarin being the most common. Spoken Chinese makes extensive use of tones and the same written word can have radically different meanings depending on the tone in which it is said. Fortunately, that is not an issue for MT, but for other language technologies such as speech recognition and generation, it adds a further layer of complexity.

Chinese is written using a set of characters known as hanzi. There are more than 100,000 individual characters, though only around 3,000-4,000 are required for basic literacy. Roughly speaking, each character corresponds to a syllable and, as such, they can either be used alone (in some cases) or in combination with others to form words. In fact, unlike most other languages, Chinese actually has no spaces between words. This means that unless there is full understanding of the context, the sequence of characters in a sentence may cause ambiguity.

More recently, written Chinese has been split into two forms: Simplified and Traditional. Simplified Chinese uses a reduced character set with less complex characters. It is used in Mainland China, whereas Traditional Chinese persists elsewhere — for instance, in Hong Kong, Taiwan and Macau. An official transliteration system called Pinyin, which uses the Latin alphabet, has also been developed to facilitate pronunciation and language learning.

Relative to English, Chinese actually has quite a simple grammar. Chinese lacks inflection and most words have only a single form. Morphologic variants such as number (singular and plural) and verb tense are not expressed grammatically. Rather, they are inferred through the wider context in the sentence. Furthermore, articles (such as the, a, an and so on) and prepositions (to, for) simply do not occur.

For instance, consider the verb to eat. The Chinese verb 吃 (chī) can mean to eat, eat, eats, eating and so on. In order to express tense, additional words like temporal adverbs (yesterday, tomorrow, now, later) can be used explicitly with no change to the verb form, as shown below in red:

我吃 (wǒ chī): I eat.

昨天我吃了 (zuótiān wǒ chī): yesterday I ate (Literally, yesterday I eat).

明天我吃 (míngtiān wǒ chī): tomorrow I will eat (Literally, tomorrow I eat).

In other cases, morphological information is implied by the context. The Chinese noun 图 (tú) means figure but it does not change if we are talking about more than one figure. This is simply inferred by the number, as shown below:

和图2所示 (hé tú 2 suǒ shì): as shown in Figure 2.

和图4-6所示 (hé tú 4-6 suǒ shì): as shown in Figures 4-6.

In terms of basic word order, Chinese is generally classified as a subject-verb-object (SVO) language, like English. However, there are considerable exceptions to this when it comes to anything beyond the most basic sentences. For instance, with noun phrases, the head noun always occurs at the end of the phrase, while relative clauses always come before the noun. These are indicated by a special particle 的 (de) — the most frequently occurring character in Chinese — which can cause real difficulty for MT.

How this affects MT

When developing MT engines, and particularly those engines developed using statistical approaches (which still underpin most practical applications), a certain level of performance can be achieved using just data-driven techniques whereby models are generated over large collections of aligned parallel data from translation memories, for example.

While this approach can work reasonably well and produce usable engines for similar languages, it simply does not cut the mustard for languages like Chinese. The purely statistical approaches lack the ability to learn complex sentence structure differences such as those caused by the de particle. In order to account for this and other characteristics of Chinese, language-specific solutions related to each of the particular challenges need to be developed.

Some of the challenges have been well researched and mature methodologies have been developed for handling them in an effective manner. Others are less straightforward and the best way to resolve them remains an open question. These challenges tend to form the core of active research topics at leading universities working on MT.

We will come back to that later but first, let’s take a look at what the biggest hurdles are for Chinese MT and how developers are addressing them.

Segmentation

Before we can even get started processing Chinese for MT, we face a challenge that we do not typically have to worry about with other languages, Japanese being a notable exception: the fact that there are no spaces between the words. In order to translate one word into another, we first need to actually determine which sequences of characters in a given sentence go together to form words.

Once identified, the words are delimited with spaces, and this process, as illustrated in Figure 1, is known as text segmentation.

However, it is not quite as straightforward as it sounds. Complications arise because segmentation is ambiguous. There are frequently multiple ways in which a given sequence of characters can be segmented, with each way holding a different meaning from the other. This can be seen in Figure 2 where the four characters together mean claim (as in a patent claim) but if they are segmented into two words, each containing two characters, then we get a completely different meaning for this sequence.

As text segmentation is not only the starting point for MT, but also many other natural language processing (NLP) applications, it has been well researched and as a consequence it’s a relatively well-understood problem. Current state-of-the-art techniques give a high level of accuracy using approaches based on conditional random fields, a probabilistic approach to prediction that aims to detect word boundaries. This approach can be supplemented by resources such as dictionaries in order to introduce an additional layer of domain knowledge, which is often required for certain types of subject matter, like highly technical content.

Solutions for Chinese text segmentation have been implemented in a number of software tools, such as the Stanford Word Segmenter (http://nlp.stanford.edu/software/segmenter.shtml) and the ansj segmenter, (https://github.com/NLPchina/ansj_seg) which are extensively used in both academic research and commercial applications.

More complex challenges

The biggest source of difficulty with Chinese-English MT, as with many other languages, is the different word order caused by fundamental differences in grammar and syntax. One specific instance of this challenge that occurs frequently in Chinese is with noun phrases. Noun phrases are moved to the end of sentences, particularly around the 的 (de) particle, as shown in Figure 3.

Generalizing over this example, we can say that adjacent noun phrases (NPs) in Chinese change order in English: Chinese [NP1, NP2] → English [NP2, NP1]. Things get more complicated as sentences become longer and more complex, containing multiple noun phrases and other phrase types. Further examples of general phrase order changes between Chinese and English include:

Three noun phrases: Chinese [NP1, NP2, NP3] → English [NP3, NP2, NP1].

Prepositional phrases: Chinese [PP, NP1, NP2] → English [NP2, NP1, PP].

This is known as a reordering problem for MT and solving it, particularly for Chinese, is still a work in progress.

In theory, we can effectively detect sentences that have such syntax by recognizing the 的 (de) particles. The problem arises when the boundaries of the (various) noun phrases need to be identified so that they can be placed in the correct order in the translation. This does not happen naturally in standard statistical MT so a distinct process needs to be used to find out which parts of the sentence are which — noun phrases, verb phrases and so on. This process is known as syntactic parsing.

Syntactic parsing, or simply parsing, is a long-studied topic in NLP though principally for English. Over the last decade, however, big strides have been made in developing and adapting parsing technology for Chinese as seen in tools such as the Berkeley Parser (http://nlp.cs.berkeley.edu/software.shtml).

The first step in the process is to identify the individual parts of speech for each Chinese word. These are then expanded to find the boundaries of longer phrases. This process can be supported using rules (based on the generalizations above, for example) to try to identify the relationships and dependencies between the phrases. At this point, the constituent parts of the sentence can be machine translated as normal and statistical models and rules for reordering can be applied.

As it stands, this type of approach to solving complex word order changes for Chinese works to an extent. However, it is far from perfect and as the complexity of the challenge increases — with longer sentences, complex subject matter and multiple nested instances of noun phrases and de particles — the less effective it becomes. As a consequence, research in this area is still very active with various approaches being explored including the use of hierarchical models and classifiers that take into account semantic and discourse context.

Not only are some languages harder than others for MT, but even within a given language pair, one translation direction can be harder than the other.

Because of the lack of inflection in Chinese, it is significantly more difficult to develop MT engines for Chinese into English than the other way around. This is because we don’t always know what tense the verb should be (grow versus grows in Figure 3), or where we need to insert words that don’t exist in the Chinese sentence (articles such as the) and so on. Again, to effectively address this issue, linguistic techniques such as part of speech tagging and morphological analysis need to be used.

Another aspect of the Chinese language that MT engines need to be aware of is which version of written Chinese we are dealing with. As mentioned above, Traditional Chinese has a much richer character set (and more complex characters, as shown in Figure 4) so if we develop MT engines with only Simplified Chinese data (which is by far more prevalent) then there will be a lot of Traditional Chinese words that we cannot translate.

Solving this problem involves developing a process to convert between the two variations which, while nontrivial, is possible. In the simplest case, a list of character mappings can be used to perform conversion. This works effectively when going from Traditional into Simplified Chinese. However, converting in the other direction is less straightforward as there is significant ambiguity and each Simplified Chinese character can map to many different Traditional Chinese characters. In such cases, wider context is required in order to disambiguate.

Research trends and activity

The complexity of Chinese for language technology coupled with its increasing strategic importance on a global scale, means that it is a very attractive topic for universities and research groups working on MT. This has been fueled by readily available funding opportunities not only within China itself — which has seen a significant increase in research funding in the past couple of years — but also in the West through various Defense Advanced Research Projects Agency programs such as GALE and BOLT.

Within China, some of the leading research groups working on MT include the Chinese Academy of Sciences (CAS), Harbin Institute of Technology (HIT) and Northeast University (NEU), among others. A comprehensive survey on “Machine Translation in China” was published in the journal of the Asia-Pacific Association for MT (AAMT) in 2011.

China has a strong tradition of exporting postgraduate students which has ultimately led to the increased capacity for research groups around the world to strengthen their work in the area of Chinese MT. Notable groups in this regard include the MT team in the ADAPT Centre at Dublin City University (my own alma mater) in Ireland co-led by well-known Chinese MT research leader, professor Qun Liu, as well as NICT-ACT and NTT in Japan and Carnegie Mellon University in the United States. There is also an increasing number of commercial companies investing in research and development (R&D) specifically for Chinese MT including the likes of Fujitsu R&D Center, Baidu, Microsoft Research Asia, and my own company, Iconic Translation Machines.

Supporting these MT research efforts, organizations such as the Linguistic Data Consortium, European Language Resources Association and TAUS have worked long and hard on the collection of large parallel corpora needed to train MT engines.

The space is very active and that trend is only going to continue to grow. While there are still significant challenges to overcome, and they are not all going to be solved in the short-term, the right people are on the job. The future of Chinese MT is in good hands.