Localization for the long tail: Part 2

By David Filip November 6, 2012

This is Part 2 of a series focusing on long tail localization.

Traditionally, translation management systems (TMS) have been able to effectively address their customers’ needs by radically simplifying the admissible choice of workflow patterns.

This was a viable solution in the past, where each of the relatively small producers had to cover the whole localization cycle not being able to effectively rely on standardized interchange with their competitor’s tools. This worked as long as there was at least one major tools provider independent of any single language service provider. However, the comprehensive TMS of the past is not able to address the generalized workflow needs of next-generation localization.

In next-generation localization, the traditional bulk localization model is one of many, and moreover one of steadily dropping importance. The bulk localization scenario kept losing significance even in the originating corporate buyer sphere where it first originated, most importantly due to the sheer efficiency of the advanced leveraging methods. Basically, translation memory was driven to the utmost limits of its usefulness through features such as structural and ID-based guaranteed leveraging, sub-segment lookups and candidate recomposition. Finally, even the bulk that is and will be necessary to translate from time to time can be addressed better by the generalized methods developed to address the new scenarios. After all, the traditional methods used to address bulk needs based on a unique combination of price/time/quality constraints is not necessarily up to date.

Generalized chunking

Chunking and reassembling are localization proxies of token splitting and merging of the general workflow theory. Present day chunking capabilities in computer aided translation toolsets and TMS stacks are driven by the bulk localization scenario and are hence outdated. Standardization on structured and modular message formats (such as DITA or XLIFF) is a must for generalized chunking that will be able to address the needs of the long tail and social localization, which are the chief traits of next-generation localization.

However big the chunk of translation work you are sourcing, you (or your outsourcers) will end up sourcing it in pieces that can be completed by one translator in a reasonable timeframe. Now the issue is obviously the word reasonable, and in different industries, on different projects and in different cultures, a reasonable timeframe will vary wildly. In industrial localization, stakeholders are as a rule of thumb happy to wait five business days for completion of a bulky translation job. So it happens that translation jobs requiring five days of work (or less) from a full time professional translator are usually not subdivided. In professional services, effort is often referred to in man-days, man-weeks, man-months and so on. Let us call a piece of translation work that can be completed by a dedicated, qualified resource in five business days the man-week chunk.

Figure 1 shows how the man-week chunk of work, which is so common in the full-time industry setting, cannot really be used with volunteers, who can typically dedicate about a fifth of the full-time equivalent. In our example, two simple projects run upon the same chunk of work but obviously, the work takes five times longer with a volunteer. Not because the volunteer is not qualified, but simply because she is a volunteer, and however dedicated she is, she has her day job to attend to.

Figure 2 shows how flexible chunking could be used to source the same job (subdivided into one man-day chunks) from the previous example with five volunteers instead of one, to bring the project duration back to scale. However, common industry tools usually do not allow for lower than file-level chunking, and if they do, rarely can the chunking happen on the fly, without previous knowledge of numbers and without error-prone and effort-intensive engineering preparation and manual coordination efforts.

The generalized chunking we are looking for must be able to present a job of an arbitrary size as an implicitly structured and chunkable whole. The chunking must be possible at all logical and hierarchical levels and it must be doable at any time by business users based on business constraints only. In other words, it must be user transparent. To that end, metadata must flow without obstacles from the content authors to content owners, coordinators, managers and translators. People who understand the content should be able to indicate effortlessly at what levels the content can be subdivided and reassembled if needed throughout its life cycle. All included or linked resources that are relevant for more than one automatically created chunk must be duplicated or referenced. These chunking requirements cannot be fulfilled without standardizing on a canonical bitext message format such as XLIFF. The next incarnation of XLIFF must explicitly address the above specified chunking and reassembling requirements. In contrast, splitting a now common TMS “translation kit” is generally a nontrivial engineering task that is being actively avoided by project managers, translators and reviewers unless absolutely necessary due to sheer volume and killer time constraint. The current bitext standard (XLIFF 1.2) provides enough structural elements (file, group, trans-unit, seg-source) to support generalized chunking, even though the processing requirements have not been specifically addressed; the real reason why generalized chunking is not widely supported is the absence or exceptionality of the need in traditional scenarios.

Figure 3 shows the high level processes of chunking and reassembling in a multilingual translation workflow; they are quite clearly the same for the traditional bulk scenario and next generation, long tail coverage scenario. The profound difference is in the frequency with which one-man chunks need to be created via actual subdivision of work presented and the resulting functional requirements. Where the actual subdivision need is rare enough, the workflow can and should be instantiated by manual splitting and coordination; this is, however, not the case if you want to successfully address the long tail.

Generalized workflows of next-generation localization that are capable of supporting the long tail (and hence volunteerism) must be able to chunk on runtime into a previously unknown number of sub-job tokens (chunks). The reassembly step must be able to merge all the transformed chunks back and moreover it must be able to degrade gracefully if a human decision maker decides not to wait for all sub-job tokens. Not waiting, protracted and partial merging including preliminary publishing should in fact become the main success scenario.

Multisource setting

Simship was once revolutionary, but it is no longer a practical idea. Let me qualify this. Simpship is still great if you are a big software publisher whose development is in one source language only. But there are several trends that play against the general usefulness of this scenario.

Content permeates software. Software tends to be tied to a specific technical platform, and apps (client web applications, increasingly mobile based, with small scope and very specifically targeted functionality) are needed to effectively consume content. The content/user interface boundary tends to blur on content platforms, particularly content management systems (CMS), as democratization on the web drives another convergence, that of CMS and wiki. Clearly multilingualism of social networking is not resolved by crowdsourcing the user interface.

Innovation no longer happens in Silicon Valley, as Sarah Lacey revealed in her book Brilliant, Crazy, Cocky: How the Top 1% of Entrepreneurs Profit from Global Chaos. Market growth and innovation happen instead in the emerging markets, and the old fashioned US-based multinationals’ teach-them-all-English solution does not cut it any longer. There are some notorious paradoxes of innovation; in particular it is well known that there are limits to the innovation that can be bought by throwing money at research and development centers. True radical innovation is not just developing cool new technology without a particular purpose, but combining it with new business models and previously unknown delivery methods. This kind of innovation more often than not originates where the rubber hits the road, which rarely is in the Silicon Valley headquarters. Therefore it is critical for multinationals that the knowledge flow in their organizations are multidirectional and even multidimensional. Any center that is not capable of processing innovation originating in the periphery will degrade. The Western world is no longer the center, but if it does not ignore the developments in new centers (once peripheries) and peripheries, it stands a chance to remain globally influential.

Service providers with limited or no intellectual property (IP) portfolios often entrench themselves in rampant incrementalism. A perfect example of an IP-poor startup that refused to stick with rampant incrementalism is Google. Despite the colossal success of its core search service, the company built an innovation machine that takes frontline ideas rapidly to prototype, labs and beta stages with monetization instruments ready to kick in as soon as they’re mature enough. Google recently proved to be the big challenger in the area of mobile land grab, as it bet everything on the purchase of Motorola (with its vast patent portfolio) to protect the Android platform and ecosystem. The Android ecosystem seems a good example of a future-proof platform that is capable of assimilating innovation coming from multiple centers and peripheries. Paraphrasing Common Sense Advisory, we are leaving the US zone. As a result, the Western based multinationals need to face the challenge of how to let their indigenous resources contribute to their success, or slowly perish.

Multilingual content management

In any multilingual content management, you need to consider the options of parallel content development, direct translation or translation over a pivot language. In practice, going over a pivot works as a sequence of two direct translations; the pivot language result might be relevant on its own, or just a perishable by-product behind the scenes.

In the following, we use the terminology of central and peripheral languages coined by Abram de Swaan in his book Words of the World (2001), as it is influential and illustrative; although we do not agree with many specific classifications of languages he made, partially because he made them more than ten years ago. Looking at the web statistics, we will agree that English is still the only hyper-central language, but Chinese will reach comparable status in less than a decade, looking at the run for resources in the developing world that the “rich” developed world is losing, ridden by its prolonged recession and unmanageable public debt crisis. We will be using the terms super-central, central and peripheral merely for describing some sort of hierarchy (with some significant horizontal relationships) without paying too much attention to the actual de Swaanian classification. Figure 4 illustrates the traditional pivot language use. In de Swaan’s terms, German does not play a sufficiently central role to reach the majority of target market languages via direct translation, so the German based multinationals need to address their communication and translation needs through the hyper-central language as a pivot.

With the increasingly prominent role of Germany within Europe and the weakening economic role of the United States, German gains centrality through building direct ties with advanced Asian economies, so that in the future we should see more direct translations between German and Asian languages such as Japanese and Chinese.

In marketing campaigns of a leading UK car brand, Austrian marketers are using the cultural closeness of the Czech market and customize Austrian copies originally written in German (based on UK masters) to reach the Czech market (Figure 5). The cultural distance is smaller and hence the cost and effort lower than recreating based on English directly. Also due to the relative centrality of German, there is a critical mass of Czech professional resources that can serve this non-mainstream language pair.

Japanese multinationals often run resource and development centers in Taiwan. Due to the super-central roles of both Japanese and Mandarin, there are enough qualified resources that can fully localize the product and documentation into Japanese. Part of the information flow may go straight to English, but the bulk of content actually is double pivoted (Figure 6). Content is decorated with locale filtering metadata either at the stage of the development in Chinese or at the stage of the translation into Japanese. The point is that no languages occurring beyond English in the information flow will be localized “fully” in the Japanese sense. Admin and error messages will remain English in the rest of locales, and admin documentation will be filtered similarly.

The above are more or less established scenarios, and they are not weird exceptions as it might seem to a veteran Western-based practitioner. Instead, next-generation localization develops to a fully generalized information flow that will support content comparison, filtering and flows upon a generalized hierarchy of hyper-central, super-central, central and peripheral languages. This hierarchy may seem to add unnecessary complexity, but in fact it does not. The role of the hierarchy is to prevent exponential growth of language pairs to be covered by language technologies each time a new language enters the game.

MT and multiple languages

Whenever a new language is added, there are two times more possible pairs than the original number of languages. This means 9,900 possible language pairs at 100 languages, 39,800 at 200 and 999,000 at 1,000. This is not insignificant, as Microsoft is currently localizing its operating system into more than 200 languages and Wikipedia is past 1,000.

Pivoting is a well-known practice in machine translation (MT). MT providers do not tell the end user what language pairs their engines actually cover, unless they proudly boast about adding a non-mainstream pair. Instead, they are offering to perform translation from Dutch to Czech, tacitly going over the English pivot behind the scenes. There is no need to stress that usage of pivots increases the likelihood of errors. Therefore having a fine-tuned multilingual information flow model will be a major competitive advantage of the future. Failure to produce such a model will lead to significant losses and sudden deaths of seemingly healthy service providers. Such a model should automatically result from long term usage of a sufficiently open (in both the technical and the business sense) and generalized translation platform, as it must take into account the existence of parallel corpora and qualified human resources based on hierarchical and horizontal relations among many languages. Such a model and system will be characterized by the possibility to request content from any language node of a CMS, and it will be able to actively facilitate and determine viable scenarios, optimizing the pass-through in terms of the number of pivots, existence of reliable natural language processing and MT  solutions.

Figure 7 shows a possible way of a piece of content originating in one peripheral language (marked blue in the diagram) to all other languages included in the simplified model. It is worth noting that the akin peripheral language is accessed directly (similar to the Austrian-Czech scenario), whereas unrelated peripheral languages are reached via a varied number of pivots of various degrees of centrality (similar to the Japanese multinational example).

As in the previous diagram, the Figure 8 workflow is kicked off by production of a valuable piece of knowledge in a peripheral language. The diagram shows the mutual interactions between a largely automated professional support organization with a community support forum. Decision making on raw MT, human translation (in absence of a suitable MT engine), or post-editing and publishing is made largely automatically by a multilingual web analytics engine that will escalate the decision making to a human decision maker in case of low self-reported confidence. The publication workflow instances approved by machine or human will lead to publishing in new languages. The impact of content in new languages is in turn analyzed and can trigger continuation of the workflow.

Takeaways for corporate and LSP practitioners

Be prepared for the need of the generalized chunking. Even if you are prepared and willing to pay for the translation work, it might be tough or impossible to find suitable human resources working in long tail language pairs that could dedicate a full-time equivalent at either short or long notice. If you cannot chunk in a smart (user transparent) standard-based way, you must calculate extra engineering resources (and time) for chunking and reassembling or prepare the external or internal customer for a very long wait.

No matter what the present role of a new locale, take time to design its connection with other locales in terms of business process, metadata and exchange formats. Try to set up the information flows as at least bidirectional upfront, as making them bidirectional post-mortem will cost you more.

Do not rely on English as the sole pivot. Regional solutions among akin languages might be significantly cheaper. Try to future-proof your workflow technology, so that you are not forced into a single pivot by hardwired automation. This means not relying on specialized (and hence limited) TMS workflow engines.