Introduction to localizing for China and Japan

By Frank Lin & Angelika Zerfaß March 7, 2011

China and Japan are the second and third largest economies in the world and possess enormous market potential for foreign products. China has the biggest user base of computers and the internet. Despite its economic, export and technological prowess, Japan imports considerable software technology.

The promise of glittering market opportunities, however, is offset by the daunting challenges posed by globalizing software for Japan and the Chinese-speaking world. Software globalization for Chinese and Japanese is perceived to be difficult compared to European languages, primarily because of the technical consequences of requiring multibyte encoding of the languages.

Besides encoding, there are less obvious yet equally pervasive and demanding challenges because of differences in linguistic and cultural conventions. Localization projects for these two languages typically are more expensive, require longer project duration, and involve more engineering and testing work.

Linguistic challenges

Probably the most intimidating facet of Japanese and Chinese software globalization is their writing systems. Chinese and Japanese writing systems share character similarity — hanzi in Chinese and kanji in Japanese. In addition to characters, Japanese uses two alphabet systems, hiragana and katakana, collectively called kana. A major reason for encoding standards to push for multibyte support is the very large number of Chinese characters. Currently, more than 40,000 Chinese characters are encoded in Unicode. Most of these characters are not in everyday use, but are nevertheless important for software to support because typically these rarer characters are used in names of people, objects and places.

Though the two languages have similar writing systems, they are very different linguistically. Chinese grammar is relatively simplistic and generally does not exhibit morphological variations. Verbs do not have tenses or conjugates based on the subject. Nouns have no plural forms and are in most cases gender-neutral. Adjectives do not inflect. Japanese grammar is, on the other hand, more complicated. Its verbs have tenses and conjugates like western languages, and though nouns also have no plural forms as in Chinese, adjectives are inflective. In terms of sentence structure, Chinese follows the subject-verb-object model like English, whereas Japanese uses the subject-object-verb model.

Chinese and Japanese are traditionally written both vertically and horizontally. In vertical writing, characters are written from top to bottom, and the next line will go to the left of the current line (right to left). Horizontal orientation became dominant after World War II. In horizontal writing orientation, characters can go from either left to right or from right to left, but today it is mostly written left-to-right as in English, thus making Chinese and Japanese adaptation to modern computer systems easier.

Chinese is a difficult language to learn to write because of the large number of characters and complicated structures of many characters. To help facilitate learning, in the 1950s and 1960s China initiated a series of orthographical reforms by simplifying the writing of many commonly used characters. The result is Simplified Chinese. Traditional Chinese is used in Taiwan, Hong Kong and Macau. Singapore has now adopted the simplified script, though to a certain extent, it still uses Traditional Chinese for people’s names. Simplified and Traditional Chinese are merely different scripts, not different languages; however, there are situations where regional differences in vocabulary exist, especially in computer technologies. Take the word computer as an example. In China it is 计算机. The same word written in Traditional Chinese is 計算機, which actually means calculator. The translation is 電腦 (electric brain).

With the globalization of cultures and commerce, and with the proliferation of internet use, the Chinese-speaking world has seen an increasing need to support both kinds of writing in the same software application or website. A good example can be seen in Chinese pages of Wikipedia (Figure 1), where a pull-down menu gives options of simplified and traditional scripts for the page, with the traditional coming in two flavors for Taiwan and Hong Kong/Macau, and a simplified version for China and Singapore.

European languages are alphabet-based, and word sorting is based on a well-defined order of letters in the alphabet, with capitalization and sometimes diacritics playing an additional role. The definition of sort order is usually the dictionary, though in some cases there are small variations, such as international sorting for Spanish and phonebook order for German.

Sorting in Japanese and Chinese is not straightforward because these languages do not have a purely alphabetical writing system. Chinese does not have an alphabet. For many years before the advent of computers, experts were trying to find and standardize an effective way of sorting. Some of the more popular methods include radical, stroke count and pronunciation. In mainland China, it is called pinyin, and in Taiwan, it is bopomofo. Sorting by pronunciation is also known as phonetic sorting. Pinyin is the predominant sorting method in China, though determining the sort order is a multitiered process. For example, pinyin may be the primary tier. When there is a tie (as many Chinese characters have identical pronunciation), the secondary criteria such as the stroke counts of the characters may be used and so forth. Pinyin is written using the Latin alphabet, with diacritics, and its order is well defined.

Japanese sorting is even more complicated than Chinese. Japanese words and sentences are written with a combination of the three writing systems, plus Latin characters, called romaji. While hiragana and katakana have well-defined sort orders within themselves, kanji’s sort orders are not as well defined. Figure 2 shows a Japanese sentence using all writing systems.

Like Chinese, a common way to sort kanji is by pronunciation. Kanji can be annotated with kana letters for its pronunciation. This annotation is called yomi or sometimes furigana. For example, the kanjis 鈴木 can be converted into the kana annotationすずき(suzuki). Well-defined phonetic sorting can be achieved if every kanji in a word or sentence is converted to kana (equivalent letters in hiragana and katakana can be treated with the same rank) — essentially a character normalization process.

The problem is that kanji characters do not have fixed pronunciations in Japanese. 鈴木 has the pronunciationすずき (suzuki), but 錫気 has the identical pronunciation. Conversely, and worse for the sake of sorting, a character can have different pronunciations depending on the context and its use. In 鈴木, the second kanji 木 is pronounced き(ki). In 木本, 木 is pronounced もく(moku). This one-to-many relation between kanji and its pronunciation has its root in the borrowing of Chinese writing into Japanese vocabulary while retaining the Japanese pronunciation.

As will be seen in the technical discussion later, the “kanji pronunciation problem” has a profound impact on any engineering work needed to sort Japanese correctly. Japanese phonetic sorting is not always implemented in software applications because of the difficulty in capturing correct kana representation of the kanjis and the need to change character comparison algorithms.

As noted above, furigana can be used for sorting, but are also sometimes used to display the pronunciation of a character above the character itself. These are then called ruby characters. The following are examples of ruby characters in both Chinese and Japanese:

běi jīng	とうきょう
北京 Chinese for Beijing	東京 Japanese for Tokyo

Cultural challenges

The cultural aspect of software globalization deals with two concerns: cultural sensitivity in the presentation and use of the software, and local conventions that necessitate flexibility and adaptability in order for the software to appear “local.”

An example of cultural localization is the use of symbols. A red cross symbol is often used to denote an ambulance or hospital, but in Islamic countries, a red crescent is used. The use of symbols can be cultural as well as political and geographical. A national flag is commonly used to denote a national language on an icon or button. While there are few controversies over the use of the Italian flag to denote Italian, it is rare to see an English flag denoting English in the United States. Likewise, having a Chinese flag denoting the Chinese language in a Taiwanese website is unusual, unless the flag is used to distinguish Simplified and Traditional Chinese.

Also, the representation of more complex information such as weather data can be very different depending on the expectations of the users. The examples in Figure 3 are taken from the Yahoo! weather reports. In Germany, a simple cloud/raindrop/sun picture per day gives the most basic information on the local weather. The United States shows a map of the country with static cloud and sun symbols. Japan, on the other hand, also shows a map of the country but with animated pictures.

On the software side, the sensitivity to pictures could have an impact on what icons are used for. One example is the help system. In the United States or in Europe, the user might see a figure resembling Einstein, a magician or even an animal such as a cat or a dolphin guiding the user through questions and answers. For Asian countries, where an animal ranks lower than a human being, an animal teaching a human would not be a good choice for the help icon. The same goes for a western-style magician character, who would have to be changed to an Asian character like a monk — careful with religious symbols though — or wise old man.

Symbols can also carry different connotations. In the western hemisphere, a ☑ symbol may indicate that something has been done or is OK, but the symbol is used to mark the mistakes in Japanese homework and therefore has a more negative connotation. Moreover, in the United States or Europe, when the user makes a mistake, the software might alert him or her by playing a sound. In Japan, on the other hand, where people often work in tight spaces, sounds that announce a mistake are not well received.

Color is another prominent example of divergent cultural perception. Figure 4 illustrates the different use of the color red in China and in the United States. In the figure are Yahoo! finance stock market index reports for Asia. One can see that the Chinese site uses red to denote stock prices going up (red = celebratory), whereas the US site uses red for just the opposite (red = warning).

In addition, Asian numerology may have different symbolism, and therefore it is necessary to take this into account in the software. Some numbers are auspicious (such as eight in China) and others may be tabooed (such as four in many East Asian countries because in Chinese and Japanese, for example, the pronunciation of the number four sounds like the word for death). This is not unlike the deliberate avoidance of the number 13 such as in floor numbering in some parts of the west.

In most European locale localization projects, the most obvious adaptations of the software for local conventions are numerical and date formats. In these projects, date formats deal with the ordering of parts of a date, and numeric formats deal with the decimal point. In Chinese and Japanese software, these two differences are slightly amplified, while other local conventions are also worthy of special attention.

The biggest difference between the numbering system in the west and in China and Japan is digit grouping. In the United States, digits are grouped in a set of three, while in China and Japan it is a set of four. China and Japan can use either Arabic numbers or their own number characters to represent numbers.

Date format is an interesting difference in local conventions, and the root of the difference is geopolitical. Japan uses an old East Asian practice of calendar by “era.” The era is changed whenever a new emperor is enthroned, and that year is the first of the new era. When the country enters a new era, the year is reset. For example, the year 2011 in Japan would be written as 平成23年 (Hisei year 23, Hisei being the reign name of the emperor and the era name), and 1987 is written as 昭和 63年 (Showa year 63). One implicit consequence of the use of a different calendar denotation is that when the era changes, the operating system needs to be updated as well.

In the Chinese-speaking world, Taiwan is the only region still using the era name in its calendar. For example, the year 2011 in Taiwan is also known as “Year 100 of the Republic” (民國100年). Unlike Japan, the era name doesn’t change in Taiwan. The People’s Republic of China, on the other hand, switched over to the “Common Era” system since its founding in 1949.

Both Japan and Taiwan are seeing wider adaptation of the Common Era calendar system in daily lives and in software applications, especially on the internet. In such cases, a date would look somewhat like 二〇一一年一月十一日 (2011 + character for year + 1 + character for month + 11 + character for day = January 11, 2011). However, the traditional calendar systems are still indispensable; people typically associate important dates, such as the year they are born, using localized date format.

A person’s name is written differently in China and Japan in that the family name comes before the given name — the famous baseball player is really Suzuki Ichiro, not Ichiro Suzuki. From a technical point of view this can be a challenge for software applications that assume a Western name order. In the West, Hungary is the only country that follows the eastern name order. More and more, however, Chinese names today could include Western names as well.

US-centric software applications have US-centric information formats: social security number, phone number and address format. It is worth mentioning that in address format, the West usually goes small to large — as in: 1600 Pennsylvania Avenue, Washington, District of Columbia, USA. In China and Japan it is the opposite. A typical address would be: China, Zhejiang Province, Hangzhou City, Changan Street, Number 100. A typical address in Japan might not even include a street name, but the city, a specific part of the city, a specific sub-part of the city and a house number: Japan, 107 Tokyo-to, Chioda-ku, Kasumigaseki 1-3-2, Mr. Ikuta Masaharu.

Currency issues in localization aren’t unique to Chinese and Japanese software. It happens that some of the most important currencies in the world use a one-character currency symbol ($, €, £, ¥ — which is used by both China and Japan). However, many currencies use multiple-character symbols, including Taiwanese (NT$). Another related currency issue is the existence of fractional portions of monetary quantity. Japan and Taiwan have relatively “small” currencies and do not use sub-units below its basic currency unit. Software applications that deal with currency must be flexible enough to handle at least these two currency issues.

Chinese and Japanese use similar punctuations as English. Some of the notable differences include the East Asian square quotation marks (「 and 」), period (。), and foreign name separator. Barak Obama is written バラク・オバマin Japanese, and 贝拉克·奥巴马 in Chinese, where the dot in the middle separates first and last name in a non-East Asian name. There is no capitalization in either language. Chinese has no alphabet, and Japanese alphabet systems have no distinction of uppercase or lowercase.

In general, word break by blank space does not exist for Chinese and Japanese. In Simplified Chinese, “Today I plan to write three novels” is written as 今天我计划写三本小说。In Japanese, it is 今日私は3つの小説を書くことを計画する。

Technical challenges

Because East Asian languages require multibyte support for the writing systems, it is natural to choose Unicode as the encoding method. Some of the most prevalent software development tools and programming languages today are Unicode-compatible. Still, a large number of programs, usually written some years ago, do not support Unicode. Thus, for these programs, Unicode support may be the first order of business in East Asian software globalization.

There are situations, though, where implementing Unicode support is less than ideal. The reason is both technical and economical. Non-Unicode programs that handle large amounts of data processing at byte and character level are susceptible to issues when ported to Unicode, often because of the interpretation of data. An example is a C++ variable declaration: char str[100]. Is a character string in the program a string of characters or a string of bytes? A string of bytes should not become a double-byte string in a Unicode program. This seemingly simple problem is exacerbated when the program’s size and age increase.

In the “legacy software” situation, there is an alternative of not converting the program to Unicode, though for mainland China, this could be a problem because of GB18030 (Chinese government mandated encoding standard) requirements. It is feasible to localize a modern program to Chinese and Japanese without converting to Unicode. Microsoft’s parlance for non-Unicode encoding is MBCS (multibyte character set). When using the Chinese (CP936 in China or CP950 in Taiwan) or Japanese (CP932) code page, a character can be either one or two bytes. Characters in the 7-bit ASCII code chart (English letters and punctuations) are encoded using one byte, and everything else is two bytes (Chinese characters, katakana, hiragana and Latin letters with diacritic marks such as á). In an MBCS environment, the programming challenge is that the string of characters is now a combination of one-byte and two-byte characters, so functions need to be developed to foretell whether the next character in the string is single or double-byte while parsing the string.

As these languages have no space separating words or characters, programs that assume a space as word break will not work correctly. Other common wrong assumptions include capitalization, Western name order and punctuations.

In the Microsoft Windows environment, input methods editor (IME) is widely used for East Asian languages. With IME and a standard English keyboard, a user can readily input Chinese and Japanese characters and kana. Computer users in mainland China are used to the English keyboard as an input device because pinyin is written with Latin letters. Users in Taiwan are likely to see Traditional Chinese keyboard overlays, with phonetic signs and radicals overlaying English keys. For Japanese, keyboard input can be in either kana mode or romaji mode. Kana mode uses Japanese kana overlay of the keyboard. With the romaji mode, spelling of Japanese words are done with Latin letters before they are converted to kana.

Because of the need to keep backward compatibility with legacy computer systems, Japanese software typically needs to support six types of encoding. Besides the three scripts in the writing system, the following three encoding schemes are used: half-width hiragana, half-width katakana and half-width alphanumeric. Software needs to recognize the equivalency across different encoding systems. For example, in Unicode, the full-width katakana letter ア has the code point 0x30A2. The half-width ｱ has the code point 0xFF71. Both may need to be normalized.

The non-deterministic nature of kanji pronunciation in Japanese makes computer support of sorting more difficult. Regular sorting methods such as a collation table do not work. Instead, it is usually necessary to store the pronunciation of each piece of sortable kanji data separately. Yomi becomes the sort key for the corresponding kanji data. Capturing yomi for a kanji can be done both explicitly and implicitly. Explicit capturing simply asks the user to input the yomi for the corresponding kanji. Implicit capturing would require the software to interact with the IME to acquire keystroke information while the user enters the kanji through IME.

Because of the difficulty and scale of effort involved in reengineering phonetic sorting and yomi data in the English software, sorting is not always correctly implemented in the software. An example of excellent Japanese sorting implementation is Microsoft Excel.

In contrast, while Chinese uses pronunciation for sorting as well, the pronunciation of most Chinese characters is deterministic. In other words, each character has a fixed rank in the pinyin sorting system. Thus, sorting can be done using a collation table. Modern database systems usually provide support for Chinese collation sorting. There are a small number of Chinese characters that can have multiple pronunciations. In these cases some amount of customization may need to be done. In the worst case, yomi style implementation can be considered, though for most applications, this would be overkill for Chinese.

Text search can be done using either exact string matching or pronunciation. In computing, because pronunciation information is not embedded in the text, it needs to be stored separately. The implementation is like Japanese sorting — requiring extra columns in the database to store pinyin/yomi data. Both Chinese and Japanese may require this. Additional linguistic and orthographic conventions in Japanese may cause additional technical complications.

An application or website may need to be able to support both Simplified and Traditional Chinese scripts. If both scripts need to be displayed concurrently on the same screen, Unicode encoding would provide much greater technical advantage, if not the sole solution for encoding.

Taiwan owns the distinction of being one of the only two locales with the so-called Year-100 problem, the other being North Korea. Similar to Y2K, some software applications use only two digits to store the year. Recall that Taiwan uses the “Republic Era” calendar, and 2011 is the year 100, which could be interpreted as Year 0 by the software and cause problems. Many firms, especially banks, have rushed to certify that their software systems are Year-100 safe.

Font and point size are often issues in East Asian user interface. For a given font, Chinese and Japanese may require a point size larger than in the English environment, and often it is not just done with using one point size higher. Chinese and Japanese characters don’t “look good” with just any point size. Some that are used more often than others, for example, 12 point rather than 10 or 11 point, to make the characters aesthetically pleasing to look at.

If a dialog with several tabs does not leave any space for text expansion into Japanese, it might be necessary to add new tabs and switch the functionalities to these new tabs in order to accommodate the longer text in a Japanese translation.

Project and translation

Chinese and Japanese localization projects typically take longer than European language localization projects. Engineering work to account for different cultural and linguistic conventions is a big contributor. In addition, translation and testing can take longer.

Because translation is regional and cultural within the same language, there could be some variance in translation due to vocabulary difference and regional preferences. The authors have experienced Japanese projects where the Japanese side sees a need for more detailed information and explanation than is provided in the original documents. Thus, a translator could be moved to add more text than was originally there in the source. This is not a problem for the end product in itself, but the translation memories or the result of an alignment of software accompanying documentation might yield sentence pairs that do not have an exact one-to-one correspondence or additional Japanese text that does not have a correspondence in the source.

Chinese and Japanese software globalization is a complicated process, but with careful planning, analysis, design and execution, these challenges are surmountable. In the end, successful software globalization for these two languages can open the door to great commercial opportunities.