Creating a language lexicon for Asian languages

By Jordi Torras September 16, 2016

Approximately 60% of the world’s population today natively speaks an Asian language. Linguists might find the respective complexities of these languages compelling, but the nuances become a serious challenge for both small businesses and enterprises wanting to compete in a global market.

As global online sales edge toward an expected $2.3 trillion by 2017, and with Asia-Pacific spending outpacing purchases in North America, cross-border shopping has become a significant opportunity for US retailers. This means that having an international customer experience strategy is essential.

For global companies that deal with multiple languages and variations, one of the key entry points to the customer journey is through a website’s search engine. From product inquiries to customer support questions, search engines facilitate most customer interactions. But with so many languages in a given region (Asia alone has over 2,000), how do website search engines manage them all?

Semantics and Meaning-Text Theory

Natural language and semantic search have been used by the big search engines for a while now, but they are just starting to become commonplace for business websites. Natural language processing (NLP) is the part of artificial intelligence concerned with programming computers to understand natural language. As opposed to a keyword search approach, NLP allows customers to search the way they speak by computing the overall meaning of the search query. It ensures that they find exactly what they’re searching for regardless of whether they use incomplete, ambiguous or unstructured questions — just like a human agent receiving a query and returning relevant results.

Natural languages tend to have ambiguities that formal languages do not because they have been constructed in fundamentally different ways. For example, in natural language interactions, the same word, phrase or even entire sentence can have multiple meanings, and one concept may be expressed in multiple different ways. This flexibility gives natural language its expressive nature, but also creates opportunities for confusion and varied interpretations.

However, by evolving natural language from lexicon to semantics, detailed and specific descriptions of the lexical units can be created for several different languages. The Meaning-Text Theory (MTT) is an advanced linguistic framework that helps to identify and bridge the gaps between natural and formal languages. Meaning-Text linguistics recognize that the elements in the lexicon (lexical units) of a language can be related to one another in an abstract semantic sense. These relations are represented in MTT as lexical functions and thus, the description of the lexicon is a crucial aspect of deep understanding of NLP.

Lexical functions are tools specially designed to represent the relations between lexical units. They allow us to formalize and describe in a relatively simple manner the complex lexical relationship network that languages present and assign a corresponding semantic weight to each element in a sentence. Most importantly, they allow us to relate analogous meanings no matter in which form they are presented.

Natural languages are more restrictive than they may seem at first glance. Consequently, in the majority of the cases, we encounter fixed expressions sooner or later. Although these have varying degrees of rigidity, ultimately they are fixed, and must be described according to some characteristic. Consider these four examples:

Obtain a result

Do a favor

Ask a question

Raise a building

All of these show us that it is the lexicon that imposes sectional restrictions, since we would hardly find “do a question” or “raise a favor” in a text. Actually, the most important factor when analyzing these phrases is that, in terms of meaning, the elements do not have the same semantic value. As shown in the examples above, the first element (the verb, in these cases) hardly provides any information, and all of the meaning or semantic weight is provided by the second element.

The crucial matter here is that the semantic relationship between the first and second element is exactly the same in every example. Roughly, what we are saying is “make/perform X” (where perform can take the form do, obtain, ask or raise, and X takes the form a result, a favor, a question or a building). This type of relation can be represented by the Oper lexical function.

MTT collects around 60 different types of lexical functions, which allow, among other things, the description of relations such as synonymy (buying and purchasing are identical actions), hypernymy/hyponymy (a dog is a type of animal) and other relations among lexical units at the sentence level. This extends to examples such as the Oper that we mentioned before, or ones expressing the concept “a lot”: if you smoke a lot you are a heavy smoker, but if you sleep a lot, you are not a “heavy sleeper,” or at least not necessarily.

Asian language lexicons

Creating lexicons for Asian languages (Chinese, Japanese and Korean) was no small feat, as they required complex systems including different written styles of characters and extensive grammatical structures to express politeness and formality.

The Japanese language lexicon was particularly tough to pair with NLP applications because there are four different writing systems in the language; all can be used together and interchangeably. The Chinese lexicon was designed to simultaneously support traditional and simplified Chinese writing systems, which allows the same semantic technology to be used in mainland China, Hong Kong, Macau, Taiwan and overseas Chinese communities. And the Korean lexicon was written almost entirely in Hangul characters, which is not written in sequential order.

Structure of the sentence

Japanese and Korean have many characteristics that differentiate them from Romance languages or English. For example, the normal sentence structure in the Asian languages we are working with is Subject>Object>Verb. This places the verb at the end of the sentence, unlike in English or Romance languages. For example, in Japanese, Kochira wa Tanaka-san desu means “This person is Mister Tanaka” but it is written as: This Mister Tanaka is.

Actually, elements can be found in several positions within the sentence. Basically, the more the element is on the left side the more it is emphasized, so elements can permute in the sentence with much more freedom than they do in Romance languages or English. Let’s take a look at the following example:

Translation: John has given a book to Veronica.

Korean: John ga Veronica ege ch’aek ul chu-o-t-ta.

Japanese: John ga Veronica ni hon o age-ta.

If we split that sentence in pieces of meaning, we would have:

A = John ga

B= Veronica ege / ni

C= ch’aek ul / hon o

D= chu-o-t-ta / age-ta

All those elements can permute to six possible combinations. ABCD would be the normal position of the elements, as shown above. However, you might also see ACBD, BACD, BCAD, CABD and CBAD. As is shown by these combinations, all elements can be moved to the left depending on the emphasis. The only element that remains in the same position is the verb (this is also the case with adjectives).

NLP systems based on MTT focus on the meaning, so despite all of these different structures, we are able to grasp the meaning without needing to describe the full grammar. In building a lexicon, however, grammar matters. Each language is unique and has its own rules and restrictions; building lexicons for Asian languages is dependent on including demonstratives and determinatives, for example. These present unique challenges for building a Japanese or Korean lexicon, and hence a grammatical system that can be understood through NLP.

In Japanese and in Korean, words don’t have either a gender or a number. This type of information is optional and will be added using affixes. In both languages, demonstratives and determinants have several degrees, as is shown in Table 1.

Japanese verbs have three main tenses: past, non-past (which includes present and future) and continuous present (-ing verbs in English). Korean has a more complex and rich verbal system that makes explicit differences between future, conditional, near future, far future and so on.

Unlike artificially created languages like computer programming languages, natural language gives us the ability to understand, process and utilize the everyday semantics that we communicate with. Through the creation of these complex lexicons, businesses can now understand the meaning behind the questions asked by their Japanese, Chinese and Korean speaking customers.

Asian languages have already made a significant impact on business and culture worldwide and will continue to exert increasingly more influence into the foreseeable future. For businesses large and small interested in competing in a global, multilingual economy, it’s imperative that they not only understand the differences among natural languages and between natural and formal languages, but that they can also leverage language nuances and subtleties to refine the online user experience in a meaningful and profitable way.