Integrating language technology

By David Filip December 18, 2012

Language technology, human language technology and natural language processing (NLP) are near synonyms used interchangeably. I personally like the term NLP, as it stresses the affinity and common pedigree with the logical analysis of natural languages, the research strand that started roughly with Aristotle’s analytics — in fact one of the first documented formalized methods on how to parse natural language to facilitate mechanical reasoning. Attempts to formalize natural languages have led to Gottfried Wilhelm Leibniz’s ideas of an absolutely unambiguous philosophical language and invention of binary encoding that in turn, a few hundred years later, gave birth to computers and computer science. The first generalized computing engines were developed to facilitate data mining during World War II.

A decade later, the idea of machine translation (MT) was born from the idea of decoding an encrypted message. In another three to four decades, localization technology developed from MT-related corpus methods in the form of translation memory. So, in fact, localization technology is a little niece of language technology.

NLP now has a handful of prominent areas that nevertheless are at different levels of adoption and expectation in the industry. One of the prestigious NLP research groups (Language Technologies Institute at Carnegie Mellon University) summarizes them as their “Bill of Rights”: Get the right information to the right people at the right time, in the right language, the right media and at the right level of detail.

However, all the items that need to be right pose a multilingualism challenge, as the vast majority of the online population regularly interacts only in one language. The challenge is to make information shareable and comparable across language silos. This means the NLP areas of MT, cross-lingual retrieval and (automated) language learning amplify the usability of the other areas: search engines, question-answering, text mining; adaptive filtering and personalization; task modeling, behavioral predictions based on anticipatory analysis; speech recognition and synthesis; summarization and drill-down techniques.

Localization vs. language technology

According to a prominent NLP researcher, the major barrier in language technology adoption is that “industries are still sitting in their XML caves.” This is only anecdotal evidence, but is nevertheless illustrative of the situation. The reality is that most corporate content is now owned or managed by support cost. Cutting down support cost is an imperative, but how do you do it without damaging user experience? Have you ever called your tech support and ended up banging your handset in desperation to get a live person onto the other end of the line? You have been a victim of the apparent success that NLP has achieved in your language. Integrating automated technologies is great, but there obviously are some gating factors. One is fitness for purpose, another is the business case and the third is to have your integration right. Having it right means make your homework with key users and use just the right amount of automation in the least intrusive way possible.

People do not mind using NLP facilitated services if they are user transparent and invisible. An example here is a mature search engine. You type your search phrase into a single input field in your browser (in most browsers it even is the functionally overloaded address field that will anticipate your typing and if you intend to use an exact location or a search term) and do not care what clever thing the machine did, as long as you get four out of five relevant entries as the top results within milliseconds. You are ever so slightly annoyed if you need to use search syntax or call up advanced search and filtering options to make what you want appear.

But even search is not paradise; let us think of an example put forth by Alan Melby. Consider a person searching for pigs as the term used in metallurgy, as opposed to the animals prized for their bacon. Wouldn’t it be useful to have the contents marked up by domain, so I could simply tell the engine I am only interested in the metallurgical pig?

While it is clear that metadata is needed for the intelligent consumption of content, people keep thinking in terms of manually marking up the old corpora. However, looking at the exponential explosion of knowledge and content, surely the future corpora are more important than the existing ones that will be dwarfed by the new ones in no time. Therefore the chief method should be to introduce metadata at the content creation time with no to minimal extra effort for the originator. Another smart way to enrich content with metadata that will facilitate further processing is to use text analytics and mark recognized entities as term candidates. Right, this can be done, but how could this be integrated into your translation or terminology management process? The answer to this is the Internationalization Tag Set (ITS) 2.0 — the successor W3C standard of ITS 1.0 — that is currently being developed by the MultilingualWeb-LT group at W3C. This effort adds several new data categories to the seven original ones that were encoded in ITS 1.0. Another important development in ITS 2.0 is that it normatively encodes ways to specify its internationalization and localization metadata categories in HTML 5 content not only in XML-based content formats as in 1.0. With ITS 2.0 you will be able to mark up your content with domain information, identify and pass on term candidates, mark and communicate quality issues, report confidence of an MT or text analytics service and more.

Plain text vs. markup environment

As per our motivational anecdote about industry sitting in XML caves, there certainly is a tension between practitioners and researchers using plain text and the ones using markup languages. This tension is being managed as part of several bilateral liaison and working liaison relationships among Unicode, W3C, OASIS and so on. The most prominent document that could be described as the status quo statement is “Unicode in XML and other Markup Languages.” Although this document has informative status at both Unicode and W3C, normative references can be and are made to this document in Unicode standards, W3C recommendations and other bodies’ standards. The chief principle is that wherever plain text uses some sort of control or stateful characters (due to plain text one dimensionality limitation) these should be replaced by markup while crossing the boundary, such as being imported by a process from plain text to markup environment. It depends whether the higher level process is supposed to be a roundtrip to see if the markup mapping needs to be one-to-one, facilitating lossless return back to plain text environment after the mark-?up processing is over. Bidirectional plain text will for instance be transformed into HTML, replacing control characters with directionality markup that can be again replaced by control characters if the content needs to be transformed back into plain text. On the other hand, if you want to use a plain text based NLP service from a markup environment you will need to strip or replace the markup and find a way to reintroduce it after processing.

Therefore, while integrating language technologies that are as a rule of thumb plain text based, you will need to make use of middleware solutions that will facilitate crossing the plaintext/mark-?up boundary. A good example here is the open source M4Loc project that facilitates usage of the plaintext based Moses framework in localization that routinely handles heavily marked-up content. In this case the markup is stripped, and the original locations of the markup are traced throughout the reorderings and replacements of n-grams, so that places corresponding to the original locations of the markup in the source language can be identified for reintroduction of markup in the target language translation candidates.

Corporate usage of NLP

The whole notion of corporate content has been transformed under the influence of several megatrends that together resulted in an amazing phenomenon that was termed Loss of Control or even Corporate Spring. As a result, a smart corporate content “owner,” or, better, manager or steward, will give up the illusion of control and will operate on all planes of the corporate content iceberg. The traditional corporate content such as user interface strings, printed manuals (increasingly printed in the sense of saving as a PDF, rather than actually using printer paper and cartridges), marketing communications, user assistance topics and so on are indeed only a tip of the iceberg representing all the relevant content that no one can dream of controlling. Instead, the corporate content steward must ensure that the tip of the iceberg that she or he controls and the higher content planes (shallow waters) that she or he can at least influence are in line. She or he can also adequately address the happenings within the uncontrollable mass of the information iceberg. The big mass of information cannot be controlled but can be chartered, probed and monitored for excessively bad or good information.

All that is by now unthinkable without NLP methods. A sentiment mining engine is now a must on every corporate information system setup. Display of content is personalized based on search and purchase history of users, and most importantly raw MT is increasingly used as initial support content that can be later customized or post-edited based on success rate analysis and sentiment mining results. The issue is that the currently available methods rarely scale to cover the increasingly needed multitude of languages, and in fact solutions must be handpicked or improvised language by language, because the mainstream mass of researchers is interested in deep and supervised learning methods that are applicable to a specific language, but usually struggle to cover even similar languages.

The interest in shallow NLP methods that could easily scale to cover dozens or hundreds of languages is relatively new and driven by corporations rather than by academic researchers. I envision that neither deep nor shallow is the one holy grail, as in the case of rule-based and statistical MT (SMT). Instead, the shallow methodologists and practitioners will within a decade or so come up with a shallow, lowest common denominator type of framework that will be widely adopted and adapted as an effective base for hybrid approaches, analogical to a baseline SMT framework such as Moses that now facilitates a multitude of hybrid though fundamentally SMT- based methods.