Cross-lingual text analytics: a new frontier in linguistics

By Meta S. Brown December 20, 2011

In Atlanta, a brand manager needs to know what consumers are saying about soft drinks in Thailand. In Washington, D.C., an intelligence agency analyst wonders if an international terrorist group is using twitter to organize a bombing. In Tokyo, electronics engineers want to better understand quality problems that are leading to product returns in North America. These diverse professionals share common concerns. Each of them needs information that is locked in some form of text, such as posts to a social network, e-mails or web form submissions. The volume of text involved is large, and the person who needs the information doesn’t understand the required language.

These professionals have different problems, roles, languages and cultures to explore, but one shared solution: cross-lingual text analytics. Market pressure and global competition drive demand for cross-lingual text analytics, and the technology is advancing to meet the challenge. Cross-lingual text analytics is where the action is and will be for years to come. Information about analytic methods and their use in business has been widely available throughout the past century, yet there has been a rapid rise in interest over the past few years. A confluence of changes in technology and economic climate has created both opportunity and pressure for businesses to initiate or expand their use of analytics.

Once requiring tedious hand calculation, statistical analysis has become cheaper and easier with the widespread availability of computers and improvements in software. The rising use of computers has also brought a rise in the volume of the raw material for analytics — data. At the same time, global competition and a weak economy put growing pressure on businesses to improve their practices; they must adapt or they will die.

Text is a form of data and, on the one hand, a rich source of information. On the other, it is bulky, difficult to analyze and often ambiguous in meaning. A number’s a number, but a piece of text might be in any one of a myriad of languages, further complicated by the education, personal style and even the mood of the individual writer. Text analytics centers on converting text into some more easily usable form of data. Cross-lingual text analytics makes this process possible when the end user of the data doesn’t speak the language of the original text. Cross-lingual text analytics enables business people to find order and meaning amid massive quantities of otherwise incomprehensible text.

Data analysis yields value only when those with influence and the power to make decisions choose to put that analysis to use. The cleanest data, most powerful computers and shrewdest analysts do no good if the analysis is ignored or regarded as a mere conversation piece. Predictive analytics methods, such as statistical hypothesis testing and modeling, data mining and operations research, examine relationships among variables and reveal the ways in which elements of a system influence one another. Cycles of data gathering, analysis and field testing provide meaningful information appropriate to guide decision-making. What’s more, a worthwhile analytics program is planned with an eye to providing actionable information, not just interesting insights. Text analysis methods, which center on converting unstructured text into categories based on subject matter and sentiment, convert the rich but largely unmanageable resource of text data into categorical data. Organized into categories, text data can be exploited in analytics processes using traditional techniques.

It’s important to note that the term analytics, in this context, goes far beyond the simple summaries commonly used in reporting, such as totals, averages, bar and pie charts. Those methods are descriptive; that is, they provide a description of the data, but no insight into mechanisms, or how one element of a system influences another. Most techniques and tools promoted as business intelligence or online analytic processing are extensions of these descriptive methods — separating data into myriad segments, but providing no more depth of analysis than ordinary reports. Reports summarize what has happened, whether in the distant past, recently or just milliseconds earlier. Although these summaries may provide a sense of the current state of the business, they do not provide meaningful guidance for action. Reports leave decision makers without information about the influence of one variable on others and without information about likely effects of alternative decisions. The decision maker is left to rely only on experience and intuition.

Historically, the resources for extracting information from text data amounted to your resources of skilled human beings. Your ability to get useful information from text was as great as the size and skill of your team for translating, scanning and summarizing the text. No access to skilled translators amounted to no ability to use foreign-language sources. A small team meant that large sources of text would go unused.

Cross-lingual text analysis takes on the issues of language and volume in dealing with text (Figure 1). New processes enable an investigator to enter a search topic in his or her native language, enhance the definition of the search with tools for identifying synonyms and related terms and for disambiguation (selecting the specific meaning desired for terms that have several meanings, as in Figure 2), then automatically translate the search terms into another language, returning only relevant material. Documents returned in the cross-lingual search can be organized by applying linguistic technology directly to the text in its original form. Finally, categorized data is returned to the investigator for use in analysis, along with any required translations.

An important element of cross-lingual text analytics is that analysis steps, such as categorization based on subject matter and sentiment, can and should be performed within the original language of the text. Bulk translation prior to text analysis is often attempted, usually because of a lack of proper tools for the original language of a particular body of text. However, text analysis on translated text consistently yields unacceptable results.

Text analytics in coming years

The popularity of social media and similar applications creates a large and growing body of text data. The volume of potentially useful text is so great, in fact, that neither the business community nor governments have sufficient resources to address it with human interpretation alone, but they are highly motivated to find viable alternatives. Knowledge that text holds clues to such valuable discoveries as sales opportunities, terrorist plots or safer medical practices is driving exploration and investment in technological alternatives and aids to human reading and analysis.

The expanding mass of computerized text is something like a mountain, or rather a range of mountains, rich with precious minerals. The value is there, its presence no secret, but a mere mortal alone is no match for a mountain. Extracting minerals from the rock requires a combination of brute force, because there is a huge volume of material, and sophistication, because the desirable material is a tiny fraction of the whole and not simple to separate from the bulk. The same can be said for text analytics. Organizations that succeed in assembling the right combination of mass scale and accuracy will enjoy a significant competitive advantage.

Cross-lingual text analytics offers unique opportunities to support business growth. Modern communication and transportation enhance the opportunity for doing business all over the world and also bring competition from all over the world. Language limitations translate to limitations of business opportunity and weakness in the face of competition with superior language skills. What’s more, lack of ability to interpret foreign languages inhibits competitive intelligence, which is the ability to monitor and understand the activities of competitors and the market’s response.

Cross-lingual text analytics enables us to extract meaning from communications in languages we may not personally understand. The brand manager in the United States needs to know what consumers are saying about soft drinks in Thailand and Malaysia, and the engineer in Japan needs to understand the comments on warranty claims from the United States and Mexico, but both are held back, not only by language, but also by the sheer volume of irrelevant material. This is an even more pronounced problem in government intelligence applications where the stakes are high and the messages are often intentionally disguised.

Translation alone is not sufficient to bridge the language gap for business information. Translating irrelevant content wastes resources, drives costs up and still leaves the user with an untenably large mass of text, plus the effects of translation errors. Because of the errors and ambiguity introduced in translation, analytics such as content and sentiment analysis are seriously inaccurate when performed on translated text. For accurate results, text analytics must be performed within the original language of the text. Cross-lingual technology makes it possible to define the relevant information in the user’s native language, automatically translate search terms into the target language(s) and perform a more accurate search with lighter resource demands.

So is it possible to obtain clear value from imperfect technology? We’re often surprised to discover that others interpret our written words in ways different than what we intended when writing. The human brain is the best tool for generating and interpreting language, but it is not perfect. A given segment of text is not interpreted consistently by a single human on multiple occasions, let alone by multiple humans. Since there is no empirical method for determining intended meaning, the baseline for benchmarking software performance is always comparison to a modest sample of human interpretations of text, a rather fuzzy basis for evaluation and improvement. Linguistics technology is consistent, but otherwise not nearly as effective as humans at interpreting language. How are we to put language technology to use when the results are so often incorrect?

Many applications simply don’t re-quire perfection or anything close to it. Consider the everyday practice of direct marketing, appealing directly to individuals to make a purchase or contribution. Most direct marketing solicitations end up in the trash — real or virtual. The direct marketer can accept this because the values of a few positive responses more than offsets total costs. The most successful practitioners are constantly testing the effects of small changes to the process, perhaps a new envelope for a traditional mailing or a new subject line in e-mail, in the quest for optimum returns.

Internet communications are awash with similar opportunities, and online business powerhouses such as Amazon and Google are continually testing elements of offers and presentations. The offers presented to an individual consumer take into account whatever information the advertiser has available. This may include past behavior, such as purchasing and browsing history; demographics, such as gender and age; and information not specific to the customer, such as weather, date or time of day. Advances in linguistics, coupled with the latest in dynamic advertising tools, open the door to adding new and valuable information to that mix.

Consider this example of an existing online advertising model. Businesses pay to place their ads in front of users of a social networking site. These advertisers can tailor the selection of users who see the ad by a number of elements that are known from the user’s profile, such as age, gender and city of residence. Further, they can specify that ads be presented in response to certain keywords so that a shoe retailer might specify that an ad be displayed in response to a user mentioning a specific brand of shoes, while a dentist might select terms such as toothache and gums bleed.

But terms often have several meanings, and the context in which terms are mentioned speaks to the intent of the user. The person who posts “I’m saving up for a pair of Manolo Blahniks” is in a shoe-shopping frame of mind. The one who posts “What kind of fool would save for Manolo Blahniks?” is not, or not for that particular brand. To the extent that linguistics can differentiate these meanings, it adds value for everyone involved in the process. Publishers adopting linguistics for ad targeting will have happier and more loyal customers and may command higher advertising prices; advertisers will see better returns through improved targeting; and users will view advertising more appropriate for their interests.

Organizations ignoring foreign lang-uage text are leaving an important market and competitive intelligence resource untapped, giving the advantage to language-savvy competitors. Bulk translation of textual data is unnecessarily slow and expensive, and yields poor quality results. Cross-lingual text analytics yields fast, accurate text analysis and cost-control as well.