Localization for the long tail: Part 1

By David Filip September 26, 2012

This is Part 1 of a series focusing on long tail localization.

Traditionally, volunteerism and professionalism have been perceived in opposition. Nevertheless, in present day collaborative environments, this distinction is increasingly blurred. By no means could we define a volunteer as someone who delivers lower quality. Often volunteers working for various good causes are motivated to perform at their very best standard, as they are emotionally engaged with the causes and, because their work is often highly visible, they would not dare to perform suboptimally.

The cluster of notions related to the term prosumer plays a prominent role here. Prosumer is a portmanteau word first coined by futurologist Alvin Toffler in the early 1980s as a contraction of proactive consumer, and most often refers to hobbyists who perform basically on par with professionals. In the context of covering the long tail of minority languages, prosumer localizers are proactive, as they do not rely on official institutions or corporations to create specialized terminology. They also more often than not display a high degree of professionalism.

It can be said that survival of languages under a certain  threshold number of speakers depends on formation of the prosumer localization communities. The market forces generated, say, by less than a million speakers are not strong enough to fund commercial localization and translation of key technology stacks and bodies of knowledge. The activism of prosumers is driven by highly emotional factors, and activist prosumers are usually very well aware that their effort is the only force that can put their mother tongue into the context of full-fledged cultural languages — in other words, to prevent its slow and unmanaged death. Localizations of Wikipedia, core open source or corporate technology stacks (such as Linux operating system, Microsoft Windows, gmail, Libre Office, Firefox) are just a few examples of formational localization projects that have been known to lead to conceptions of modern technology-ready versions of minority languages in emerging regions.

In our industry the volunteerism and prosumer debates had often been grotesquely shrunk into the so-called issue of translation crowdsourcing, and professional translators feel threatened by the price pressure of the “free alternatives” to their professional services. This is, of course, an oversimplification, because building a working machine translation (MT) or crowdsourcing solution is known to have steep investment requirements and despite popular belief there is ongoing cost associated with keeping them running. This is also the reason why prosumerism in localization cannot be confined to the area of translation. Volunteerism needs to permeate the whole organizational structure or network to keep minority localizations running in terms of infrastructure coordination, but also the baseline standardization of orthography, terminology, scripts (Unicode), locale specifics (Unicode Common Locale Data Repository — CLDR) and so on.

The topic of prosumerism is prominent in China (both PRC and ROC), both in the context of the translation industry and in general. Comparative studies by Taiwanese scholars of crowdsourced (in PRC) and professionally translated and edited content (in ROC) offer good lessons here, similar to Miguel A. Jimenez-Crespo’s study of the Spanish Facebook translator community or another study by CNGL’s Magdalena Dombek about the Polish one. Jimenez-Crespo shows that Spanish Facebook has achieved the feeling of an indigenous Spanish page, and we know from public Facebook presentations that the price tag is not free. Therefore, it is important for minority languages’ survival to be represented in localization prosumer communities rather than just translation communities. Shared technology and infrastructure play an important role, and in turn the cost of these can be driven down by standardization — not only language, alphabet and script standardization but also project and resource exchange formats. There are many players in localization crowdsourcing in emerging markets, including Facebook, Twitter, Microsoft, The African Network for Localization (ANLoc), Translators without Borders and The Rosetta Foundation. Nevertheless, it is obvious that the methods that are a must for economically nonviable minority languages can be generalized and replicated with mature language communities. The important ingredient is passion, and as Spanish speakers do not need to be anxious about the future of their language, they will prosumerize based on other passions, such as other sorts of good causes or simply a passion for a product.

The previously mentioned Chinese studies show that the professional and prosumer scenes differ in methods and values, but foremost in their goals. Because a prosumer translator of high tech news content strives to produce a document mirroring the original English source in its absolute difference, he or she tends to use literal translation as a conscious method for achieving this, and may resort to long and insightful translator notes to convey meaning that is outside of Chinese context and thus not within the intellectual grasp of the average interested reader. On the other end, the professional publisher strives to create an indigenous experience as if the news was originally written in Chinese for Chinese people. Compensation is a chief method; culturally unintelligible examples are replaced with local ones, and regionally irrelevant paragraphs are completely deleted. It is important to note that in many minority languages, unlike in Mandarin and other established languages, the compensation option simply may not exist, and a literal translation with blatant conveying of “otherness” is the only option. It is important to note that large translation projects such as Bible translations, Shakespeare translations and so on played an extremely important formative role in the development and enrichment of languages. Nevertheless, before those formative projects, the then emerging languages did not have indigenous methods of conveying new imagery and conceptual worlds. The literal translations that resulted enriched the language, much like now-emerging languages are being enriched by Wikipedia translations and major user interface translation projects.

Apart from volunteer translators, we need to look at the volunteerism in all involved job descriptions. In the  not-for-profit setting, any single actor can be an unpaid volunteer who can only dedicate a fraction of a full-time equivalent to the cause in question. This is, however, not solely a nonprofit phenomenon. In the corporate world, many vital functions such as industry practice standardization, professional forums and associations are driven by part-time volunteers. Big corporations do include such work in (rather rare) job descriptions or sponsor full-time chairs, editors and so on. This is, however, not at all the rule or even statistically significant. Much critical standardization work is in fact produced by dedicated volunteers, who despite representing their companies in the forums, work after hours because they want to. Subject matter experts and in-country reviewers in many corporate settings do work on the review tasks on top of their regular duties, and they need to be presented with the tasks in the most efficient way possible. For all this, the scaling up of localization processes that might be first considered to cover the long tail is totally relevant for the mainstream, core processes.

Let us look at the involved actors based on The Rosetta Foundation requirements gathering with several global and regional not-for-profit and corporate players such as Kiva (a US microfinancing NGO), Adobe, ANLoc and so on.

The special roles defined as collaborating actors in the basic use case for social localization in Figure 1 can be performed by overlapping sets of users. Figure 2 shows that no role is mutually exclusive, which means that any combination of the roles can be played by a single individual actor. In the context of covering the long tail of languages any one of them can be a volunteer.

Standardization challenges

with emerging languages

Some emerging regions, prominently Africa, now suffer from the hyperinflation of the notion of a language. To have its own language is a great weapon in the struggle for self-determination of peoples. Many small groups of several hundred to several hundred thousand speakers come to assert their rights in the national politics of their countries, coming for the first time out of their economic and political isolation. The local chiefs are rarely able or willing to see similarities, and the potential benefits of forming equivalences of their dialects with the ones of the neighboring chief, and thus force the standardization of their own language at the national level. They pick a script and alphabet and leave a PhD candidate with the task of figuring out the written form of their language as part of a linguistic program of a regional university or academy, and they return to their village victorious with their own language established. Thus, countries like Senegal end up with 37 languages for less than 13 million total speakers. I am aware that I may sound politically incorrect, but still, let us hope for some sort of consolidation for the sake of the African people themselves. The challenge of covering the long tail will not be less formidable for such consolidations. The small language communities would stand a much better chance of defending their indigenous cultural heritage if they joined standardization and localization forces among close dialects.

It is impossible to draw a clear and uncontroversial boundary between the extensions of the concepts of a language and a dialect. The reasons to consider dialects languages, and languages dialects, are rarely rational or otherwise based on scientific methods or facts. Homogenous languages spoken by moderate to small numbers of native speakers have been split under political pressure, and artificially imposed script variants introduced to support various geopolitical or religious claims. On the opposite end of the spectrum, China and the Islamic world insist that Chinese and Arabic are homogenous languages and thus politicians do not grant language status to large dialects spoken by many millions that would be considered languages according to Western understanding. It is, however, well-known that the spoken dialects of many Chinese provinces, although officially the same Chinese, are mutually unintelligible and the identity of mainland Chinese is protected solely by the general intelligibility of the Simplified Chinese characters invariably used to encode phonetically different morphemes of the same sense. The power of written language-driven standardization must not be underestimated.

Leaders of formative localization projects should not underestimate the role of their project in shaping the emerging technology-ready language; they should be actively aware of it and as a result should strive to strike a balance with the official academic or government institution in charge of language matters. This all of course means extra cost. If this cannot be budgeted or otherwise covered or accounted for in collaboration with other stakeholders (such as self-appointed language stewards), there will be explosive issues during and after quality assurance cycles, as the omitted stakeholders will gradually come to see and wonder what has been produced.

Deep vs. shallow methods

This is similar to and in some sense an extension of the contrast between rule-based (RBMT) and statistical (SMT) methods in the area of MT. There is no single canonical solution to either of these controversies. Instead, we should think of them as functions of required quality and available resources, their kinds and price tags. As from the 1950s through the 1990s the computing resources were scarce and expensive compared to even highly qualified human labor, the natural MT paradigm was rule-based because the rules were simply being invented and encoded by a highly qualified and specialized computational linguist. Some of the RBMT engines that had been developed for decades achieved remarkable quality on their pairs and domains. But in the meantime, in clear correlation with the falling price of computing power and growing volumes of parallel corpora produced by translation memory (TM), the competing statistical paradigm gained prevalence. However, the limits of data are being hit in present day. On several commercially available general engine language pairs, there is not enough data on Earth to buy them a single additional point of BLUE or METEOR by the sheer power of statistics. Thus, hybrid approaches are gaining prevalence and it can be said that all commercially viable solutions in 2012 are in some sense hybrid; the rules find their ways into the largely statistically built engines at various stages of engine development, be it at the stage of data selection; automatic or semi-supervised inference of syntactic trees or treelets; supervised identification of significant n-grams; introduction of runtime vocabularies; and so on.

MT is just one of many instances of machine learning that is used in the vast research area of natural language processing (NLP). One of the most prominent areas of NLP-based text analytics is sentiment mining and monitoring. Sentiment recognition data analytics is about the most powerful response of the corporate players to the loss of control over “their” content. The first step to improving your image is actually to figure out if it needs improvement and, if there are negative sentiments around, what caused them and can they be remedied with anything that still is (more or less) under corporate control, such as feature development. Because of the extremely important role of sentiment mining in corporate decision making, multinationals are often not willing to wait for deep methods to cover the long tail and are increasingly investing into shallow methods such as this for detecting exceptional (negative or positive) sentiments across a vast number of languages. As in many areas, the long-term viable solutions will be hybrid and most probably crowdsourced by passionate prosumers. As discussed previously, long tail language communities that won’t prosumerize will most probably perish, and therefore merely covering them with shallow methods will be of only ephemeral significance.

Business intelligence

It is vital for long-term data hygiene that multilingual content is designed with multidirectional (as in ordered pairs, not directionality of text) multilingual information flow in mind. Looking at any piece of content inside an organizational context, you should be able to tell if this piece of content was developed in the language in which it is now, if it is translatable or not, if it needs culture or locale specific treatment, if it was machine or human translated, eventually post-edited, if it passed any quality assurance steps and with what results.

This might sound like a utopian vision, but actually it is not. The key to global business intelligence without a centralized and privileged monolithic hub clearly is the adoption of standardized message formats, and there are a number of core standards (such as XML and HTML), specialized vocabularies (such as XLIFF), various web services standards, and other core standards and repositories (such as Unicode and CLDR) that allow for the building of such a permeating locationless intelligence. This picture is currently being perfected by the development of two major relevant next-generation specifications, OASIS XLIFF 2.0 and W3C ITS 2.0, which are set to produce semantically matching results due to their strong and explicit liaison interrelations.

Business intelligence must be collected to drive decisions on content creation strategies and workflows. Any piece of information created anywhere within a multinational distributed structure must become the subject of an overall multilingual content intelligence platform. We must qualify that this platform cannot and must not be thought of in terms of a monolithic tool stack as is often found in the translation management system category. It is a natural development of the concept of an open web platform; it is a platform of common interests and matching process-oriented semantics. This full content life cycle must be considered when making strategies for massively multilingual content in any setting.