The localization standards ecosystem

By David Filip April 11, 2012

Real working standards, open or not, must have been driven by a representative industry consensus. In fact, industry representativeness is one of the main competitive characteristics in standards bodies (consortia) in general and in the localization industry in particular.
Let us start with the assumption that openness of standards is an intrinsically positive property and that the localization industry would do well to keep its standards open rather than proprietary. In order to argue for open standards, however, we must first explain the conceptual difference between open standards and proprietary standards, and also between open standards and open source, which is quite an important distinction that is popularly confused.
As seen in Figure 1, closed source solutions can still be implementations of open standards. Interoperability is truly barred where a closed source solution is at the same time proprietary — fully ad hoc or based on a proprietary standard. The best case from the point of view of interoperability clearly is an open source implementation of open standards; perfect examples of this are International Components for Unicode (ICU) and Okapi framework. To make the whole difference entirely clear, we can say that the opposite of open in open source is closed, whereas the opposite of open in open standards is proprietary.
In order to explain the fine distinction between open and proprietary standards, two essential characteristics of open standards need to be explained: transparency and guaranteed royalty free use (Figure 2).
Although there unfortunately is no general consensus on which standards can be called open, all current accounts do agree that open standards must have been created via a transparent formal process. The requirement of formality for a standard’s transparency might not seem obvious; still, any process that needs to be publicly verified as transparent better be formal. We speak about transparency of the technical committee (TC) process. The process is usually codified on the consortium level, but various technical committees and types of standardization groups within a single consortium might meaningfully differ in their processes, which may be largely driven by their subject matter, objectives, scope and so on. All of them, however, need to create standards through representative consensus. The consensus, in turn, needs to have been driven by a transparent formal process. There are some implementation characteristics that should generally help you recognize if a body’s TC process is a transparent one or not. There are many transparency marks that we should look for. At the very basic level, mailing lists (discussion groups, forums) and archives, proceedings of meetings, and requirements gathering review process and bug tracking should all be publicly visible — open to general input and specific feedback from the nonmember public. Work in progress should be marked with open and well-documented discussions, ideally with the help of collaborative authoring tools with “blame” assignment and audit trail. Formal standard approval must be underpinned by verifiable (possibly open source) implementations. Fair quorum and ballot rules must be ensured at all levels. Broad and varied industry representation should be the result of the above.
Intellectual property rights (IPR) modes and policies are in fact the source of disagreement on what standards actually can be called open. All competing definitions would agree on the above-specified criterion of transparency; there are, however, at least three distinct levels of IPR liberty that can possibly be required to call a standard open. Therefore, I would say that the openness of standards comes in three levels: the weakest, which allows for (F)RAND licensing of essential patents (ISO, Unicode, some OASIS committees); relatively strong, which enforces royalty free (RF) access to essential patents (W3C, some OASIS committees, including XLIFF TC); and the strongest, which requires an open source reference implementation. Clearly RF is a prerequisite here, as open source cannot deal with code making use of paid patents.

One standard or many?
Every now and again people from various corners in the industry propose the intriguing idea of one superstandard, or a related one of a super-standardization body. I want to discuss the ideas of a superstandard and a super-body on a conceptual level, as it is hard to track a single source of this idea and it may well be that nobody propagates the idea so wholeheartedly as to take credit.
A superstandard naturally sounds great at first. It seems to be the ultimate thing in standardization. After all, standardization is about doing things in a uniform way and overcoming proprietary differences. However, as always, the devil hides in the details; standards must be adopted and implemented to be true standards. The smartest specification cannot reasonably be called a standard if it is not being used throughout its target market. Standardization bodies sometimes inadvertently blur this difference. Such a blurring is always connected to a standardization body’s failing or missing its goal. A published specification will fail if the producing technical committee was not representative or otherwise significant for the targeted market, and if the reference implementations are feeble or extremely limited in number. Important standards will collapse if the publishing standardization body fails to manage its specification’s later life cycle, and failure to maintain once-influential specifications leads to a slow unmanaged death of the standard.
Hence, a standard must address issues in a sufficiently discreet area so that it can be not only developed, but also implemented, maintained and updated.
As one of the main objectives of technical standards is interoperability and no area of standardization can be conceived as an isolated one, no true standard must aspire to becoming a superstandard. Where one standard starts assuming functions outside of its scope, it is starting to violate not only the realm of its neighboring standard, but more importantly the very constitutive principles of the true open standards creation. Since it is impossible to gather in one TC all the expertise needed to produce a working superstandard, any aspiration to a superstandard would necessarily lead sooner or later to deficiencies. In other words, a super-specification cannot ever make a working standard. Thus, a superstandard is conceptually impossible. Therefore, setting out the proper scope — at the time of a TC creation, charter clarification, or rechartering — is the first gating factor of standardization success.
It is no doubt tempting for standard developers and implementers to extend their own provisions into other standardization areas. Nonetheless, whenever they do this, they act in a proprietary way.
In more unified industries, the idea of one superstandard might sound less naïve than in our localization industry. Localization is a relatively new industry and academic topic, but it has always been tightly connected with translation and internationalization, since it is a service area mitigating the clash between human language and IT. There are thousands of languages on the planet that can be organized in many dozens of groups from different points of view, and Unicode is the formidable effort that currently does not miss its target to ensure that scripts needed to write all human natural languages are supported in IT infrastructures — that is, in IT infrastructures that do support the Unicode 6.1 standard. Unicode 6.1 was released on January 31, 2012, and although a minor version, it added 732 new characters to the major revision published as 6.0 in October 2011. Seven script extensions have been made supporting new languages in Asia and Africa.

Unicode
Unicode is of utmost importance for internationalization, the upstream activities that make the localizer’s life easier. The most explicit internationalization-related Unicode project is Unicode Common Locale Data Repository (CLDR), which is not a standard in a classical sense, but rather, as the name suggests, a repository that is being constantly updated and released on a rolling basis following its data release process. CLDR is a standard repository of internationalization building blocks, such as date, time and currency formats, sorting rules and so on. To illustrate the importance of Unicode and CLDR for upstream internationalization it is good to mention ICU, an IBM-driven open source project, and the most prominent reference implementation of both Unicode and CLDR. Spectacularly, ICU is the common denominator of both Android and iPhone operating systems.
It happened only very recently (May 2011) that the Unicode Consortium formed a technical committee explicitly dedicated to localization interoperability, the Unicode Localization Interoperability (ULI) technical committee. ULI is at first glance a formidable buy-side driven initiative, as its convenor is IBM’s Helena Shih Chapman and her co-chairs are Kevin Lenzo from Apple along with Uwe Stahlschmidt from Microsoft’s Windows team. The ULI vision is a pragmatic one driven by the personal shrewdness of its founding chair, looking at real-life friction points and fixing them in an effective yet robust way, as opposed to the quick and dirty fix that has been all too common in our industry.
ULI applies itself first to the cluster of issues that are connected to segmentation. Segmentation is generally governed by UAX #29 (Unicode Annex #29), and The Localization Industry Standards Association (LISA) once made an attempt to address this issue with SRX. SRX’s relationship to UAX #29 is not a transparent one. SRX can be considered incomplete from an engineering point of view; however, its proclaimed goal was not to provide a set of segmentation rules for a number of languages, but rather to provide a mechanism to exchange the rules to improve translation memory (TM) exchange (TMX) interoperability. TMX often fails to guarantee its proclaimed lossless transfer of translation memory (TM) data due to segmentation differences. Whoever will revive work on SRX will need to face the new ULI developments.
ULI’s first practically achievable goal is to update UAX #29 rules with real-life production data. The ULI message to SRX developers is that localization should take UAX #29 and its ICU implementation for granted, and use SRX only for exchange of rules that differ from the standard UAX #29 behavior. Connected to the UAX #29 normalization is the ULI driven new character pair (segment separator and joiner) proposal. This new pair of characters should only be legal in a plain text environment and should facilitate pipelined execution of UAX #29 rules. After ULI fulfills its UAX #29 mandate, it should apply itself to one or more of the following localization interoperability topics: First, solid wordcounting (scoping) based on unambiguous character and word delimiting rules that will depend on GMX-V status under ETSI ISG LIS, an industry specification group that oversees the former LISA OSCAR standards portfolio. Second, authoring and TM standardization that will depend on TMX status under ETSI ISG LIS and status of memory exchange within XLIFF 2.0. Third, XLIFF profiling for interoperability, especially with respect to segmentation and memory exchange. Fourth, lemmatization for CLDR and terminology exchange. Lemmatization is language specific and thus huge. Hence, ULI might end up only scoping the project and preparing groundwork for a successor technical committee. This seems natural since unambiguous and reliable UAX #29 is to some extent a prerequisite of all of the above.

W3C
The World Wide Web Consortium (W3C) owns many key standards, including XML and HTML. HTML 5 is clearly the coolest standardization stuff that is currently going on, and probably the only standardization activity that makes it into general media.
Since the web used to be largely a Western (or “first world”) phenomenon, W3C standards have been suffering from all sorts of internationalization deficiencies, and this was the reason to set up the W3C Internationalization Activity in 2006. As the world wide web still has a lot of multilingualism issues to address, the original Internationalization Core Working Group Charter period has been extended three times so far, and its current mandate will expire by the end of 2013, having been prolonged for the third time by virtue of the MultilingualWeb-LT Charter approval. The Working Group and Activity continue playing a critical role in ensuring internationalization readiness of other core W3C standards such as HTML 5, XML, CSS and so on.
W3C Internationalization Tag Set is a mechanism to mainly provide XML content with metadata facilitating localization or cultural adaptation. The most important data categories include the translate flag, term identification mark-up and directionality information. The specification is currently maintained by the internationalization tag set (ITS) Interest Group that is, however, not mandated to produce a new normative version. New development should be expected from the MultilingualWeb-LT Working Group that has the mandate to produce an ITS successor.
DFKI (Deutsches Forschungszentrum für Künstliche Intelligenz) and the Centre for Next Generation Localisation, along with other academic and industry partners, joined forces to forge a strong representative European Commission (EC)-funded consortium with the goal to develop metadata categories facilitating interoperability across domains (content management and localization) and between web layers — deep web and surface web. The EC-funded seed group has formed an open W3C working group called MultilingualWeb-LT that has in the meantime attracted the attention of many more W3C members. MultilingualWeb-LT has chartered external dependency with XLIFF’s Organization for Advancement of Structured Information Standards (OASIS) and from the start has had a strong liaison relationship with the XLIFF TC, although a formal liaison is yet to be formed. It is critical that the two groups work in sync to ensure semantic and functional matches among XML, HTML 5 and XLIFF 2.0 internationalization and localization metadata. On top of ITS 1.0, MultilingualWeb-LT will provide a normative recommendation for implementation of ITS and other data categories in XML and HTML 5. The group concentrates on three main success scenarios and exchange of ITS metadata categories in those scenarios: deep web (such as DITA or Docbook content) exchanging internationalization metadata with a generalized translation management system; surface web (the generated HTML 5 content) exchanging internationalization metadata with a real-time MT service; and deep web exchanging internationalization metadata with statistical machine translation (SMT) engine training.
The common denominator clearly is internationalization metadata to span content creation and management, and bitext transformation standards. In all cases, the vehicle in the bitext transformation standards area is supposed to be XLIFF.

OASIS XLIFF
Bitext is text in two natural languages organized as a succession of ordered and aligned source and target pairs. Such a structure is key for localization transformations; thus, bitext standardization is the key to localization interoperability. XLIFF is the state-of-the-art bitext standard format, superior in many respects (legal and technical) to proprietary and legacy bitext formats such as “unclean” rtf, ttx and PO gettext.
The current version, XLIFF 1.2, was published as an OASIS standard in 2008, and ever since 2008 the technical committee has been thinking of a brave new major release numbered 2.0. XLIFF 1.2’s publication date, February 1, 2008, coincides in a significant way with an important acquisition that was publicly disclosed only ten days later: SDL’s acquisition of Idiom Technologies. This is significant for more than one reason. By this acquisition SDL proved to the industry buyers, finally and beyond any reasonable doubt, that they cannot afford to rely on a de facto standard driven by a technology supplier independent of a single services company. This dream should have been shattered by 2005, when SDL acquired Trados, the de facto standard computer-aided translation (CAT) workbench by that time along with two associated (but not fully compatible between themselves) bitext formats, rtf and XML-based ttx.
Nevertheless, it seems that buy-side and toolmaker interest in XLIFF and its future incarnation has been steadily growing since the Idiom acquisition. It is natural that after they publish a standard, TCs enter a prolonged calm period. For XLIFF, the calm period lasted until approximately 2010.
In 2010, the XLIFF TC started holding annual international symposia. The first symposium took place as a preconference activity of the Localisation Research Centre (LRC) XV annual conference in Limerick. In Limerick, Oracle’s Niall Murphy explained how XLIFF is critical for unification of localization processes throughout Oracle’s numerous acquisitions. One of the highlights of the Warsaw Symposium in September 2011 was the presentation by Uwe Stahlschmidt and Kevin O’Donnell of Microsoft explaining how XLIFF plays a crucial role in their vision of operating system localization with service providers’ free choice of tools that will all fulfill descriptive (rather than prescriptive) engineering quality requirements, and how they are actually using the current version of the standard for a Windows release localization. Both symposia facilitated meaningful discussion between the TC and its customer base.
At the publication of this paper, the XLIFF 2.0 development will be driven by approximately 21 voting members. Voting members means people who regularly work on the standard; most standardization committees usually have far more inactive members than active ones. Upwards of 20 people actively working on a single standard’s development is actually significant momentum for any standardization area, not just our localization industry. Moreover, this small crowd is currently composed from a nicely representative mix, as you can see in Figure 4.
Currently, about a third of the committee represents large enterprise buyers (SAP, IBM, Oracle), and this sound proportion seems to be sustainable looking at our voting membership growth pipeline. Microsoft joined in late February and will soon have voting rights. Tool vendors are represented by just under a quarter of the committee (SDL, MultiCorpora, Maxprograms, ENLASO, Lionbridge). Service providers with less than 10% might seem underrepresented at the first glance (SDL, Lionbridge, ENLASO); however, the associations represented on the committee in turn represent their service-side membership (Polish Association of Translation Agencies and GALA more so than TAUS).Hence, it can be said that the service-side is represented by about a fifth of the committee. Individuals and academics add just the right amount of independence to the mix with their 19% combined.
Apart from technical development of the next generation standard, the TC does other things such as nurturing liaison memberships or exploring state of the art of XLIFF implementations.

ISO TC 37
ISO TC 37 is a co-host of two ex-OSCAR standards. ISO TC 37 also looks into co-publishing of current XLIFF 1.2 and next XLIFF 2.0 with OASIS. OASIS has a special privileged relationship with ISO that allows fast track co-publishing. ISO is important as a dissemination channel for many standardization areas. ISO adoption effectively leads to government enforcement of standards through ISO’s tight relationship with national standardization bodies such as ANSI in the United States or DIN in Germany, to name but a few.

Recent developments
Open standards have a long tradition in our industry. LISA was set up as early as 1990. In terms of IT standardization, this is ancient, to use the same word as Scott McGrath, COO of OASIS, the host of XLIFF, the second oldest localization standardization body, officially set up under OASIS in 2001/2002 after a period of legal clarifications.
Unfortunately, as we all know, LISA died in February/March 2011, and we may say that it died after a protracted illness or a long-standing failure to address its proclaimed goals. LISA was producing its standards via its special interest group OSCAR. Now, what is the current situation of the OSCAR portfolio, and how heavy a blow was this death for the industry? I am going to argue that the blow was in fact not too heavy. One of the reasons is that XLIFF, the pivotal standard of the localization industry, has in early 2012 significant momentum toward version 2.0 and excellent alliance-forming potential with other relevant standardization activities within W3C, Unicode and so on.
A snapshot of the OSCAR standards portfolio, as it was at the point when LISA died, was released under a Creative Commons License and is currently being hosted on at least two reasonably independent servers: Alan Melby’s (www.ttt.org/oscarstandards) and GALA Stand-
ards Initiative (www.gala-global.org/lisa-oscar-standards). However, the names of the standards, their logos and so on were transferred as separate intellectual property to a standardization body that LISA management selected after the insolvency was declared. The chosen standardization body is the European Telecommunications Standards Institute (ETSI), one of the world’s most influential producers of world-class (yet proprietary) telecommunications standards.
Specifically, the chosen standardization body is ETSI’s industry specification group for Localisation Industry Standards (ISG LIS). This basically means that although the OSCAR portfolio was technically released under a very liberal Creative Commons License, official successor versions of the standards can only be created by ETSI ISG LIS, or someone who has concluded an agreement about the specific intellectual property items (such as names and logos) with ETSI. It should be said that in September 2011, ETSI initialized inclusion of ISG LIS and its standards in the Memorandum of Understanding (MOU) that has existed between ETSI and OASIS since 2007. This memorandum was expanded on April 20, 2011, to cover areas of electronic signatures, emergency management and other areas. Areas Mapping (an Annex to the MOU) was further expanded in September 2011 to cover localization standards as well. Explicitly mentioned are OAXAL (Open Architecture for XML Authoring and Localization Reference Model) and XLIFF TCs for OASIS and ISG LIS for ETSI.
Two of the more important LISA OSCAR standards had become effectively co-owned by ISO TC 37 before LISA achieved co-publication of TBX 2.0 with ISO in 2007/2008, as ISO 30042:2008. The SRX is under development in SC4, TC37 as ISO/CD 24621, although SRX has not been co-published with ISO so far. I assume that ETSI ISG LIS will be able to continue LISA’s collaboration on ISO/CD 24621, as the latest version of this item is dated after LISA died but before the initial ETSI ISG LIS meeting was held.
As the oldest localization standardization organization officially stopped working, our industry reacted with a splutter of activity, starting as early as the Danvers Standards Summit that was called by dying LISA and concluded its second and closing day only after LISA had been officially declared insolvent. A number of industry stakeholders gathered near Boston in Danvers, Massachusetts, including but not limited to Arle Lommel (LISA OSCAR), Kara Warburton (LISA OSCAR, ISO TC 37 chair), Henry Dotterer (ProZ.com), Jaap van der Meer (TAUS and TDA), Helena Shih Chapman (IBM), Andrzej Zydroń (XTM), Alan Melby (Brigham Young University), Smith Yewell (Welocalize), Joachim Schurig (Lionbridge) and a few dozen more. A number of delegates including van der Meer joined Chapman for an improvised meeting at IBM premises the day after to discuss localization standardization’s future. Chapman also disclosed that the Unicode Localization Interoperability Committee was in the process of creation since late 2010 (it was finally kicked off in May 2011). This was also where van der Meer first pledged to become the industry’s standards watchdog. The ensuing power game produced an interesting and, in my opinion, positive side effect. Standardization is perceived after a long time as something more than just a box to tick off on a complex request for proposal form. A number of stakeholders from across the industry, technical as well as business users, realized that the standards are not being imposed onto the community by some sort of godlike powers that be and that the standards can only be as good as the community engagement that drives them.
The XLIFF TC, as a body with a working interchange standard with credible vision of a next generation version, benefited nicely from this new interest. Apart from the XLIFF TC attracting IBM, Lionbridge, Oracle, Microsoft and others to join/rejoin the specification effort, all bigger picture standardization efforts set up within the rolling 12 months before and after Danvers have dependencies with XLIFF, and for most of them XLIFF is instrumental for reaching their set goals.
Interoperability Now! (IN!) is a quasi-standardization effort driven by Sven Andrä, Andrä/ONTRAM’s owner and CEO, that was announced at Localization World Seattle in October 2010. The goal was to produce an XLIFF profile with extensions to produce real machine to machine interoperability. Although I sympathize with IN!’s goals and regard its XLIFF:doc profile as a valuable input for XLIFF 2.0 development, this endeavor cannot be assessed as a complete success. Two of the three involved toolmakers (ONTRAM and Welocalize) were able to make a reference roundtrip implementation prototype of their interoperability package and protocol (including the mentioned XLIFF:doc profile as the payload of the package) within one year of the announcement of the initiative. Their goal was basically to create a standard without a standardization body, but following the principles of collaboration driven consensus and transparency (good principles per se), they ended up creating a mock standardization body with restricted participation and hence limited representativeness. XTM has recently joined the IN! effort as the first new entrant since inception.
I must say that their time would have been better invested on the XLIFF TC than in their limited representation meetings. XLIFF’s promotion and liaison subcommittee recently organized an “infocall” to allow for transparent, if informal, discussion, and IN! members have been repeatedly invited to join the committee and push their business critical extensions into the standard proper.
Although TDA (TAUS Data Association) is legally distinct from TAUS, both activities are owned by industry veteran Jaap van der Meer and have closely related agendas. Since this distinction is largely obscured and not understood even by industry insiders, it might be good to explain. TDA (the industry’s “super cloud”) is the not-for-profit association founded in summer 2008, whose members pool TM and financial resources to create an industry relevant collection of TMs that can be used for SMT training — apart from a few other more or less related purposes, such as terminology search throughout the meta-corpus. On the other hand, TAUS has been characterized as the industry “think tank” lobbying for automation technologies within the translation and localization industry since November 2004. It added the new epithet of “Interoperability Watchdog” during and following the LISA dissolution aftermath.
Before TAUS became the “watchdog” and formed its Standards Advisory Board, van der Meer had been resisting the call of the joint memberships of both of his industry associations to join TCs and to start influencing the standardization. TDA bet on TMX, LISA OSCAR’s TM exchange format, without ever making an attempt to influence TMX development. But there are numerous issues with this standard. Under-engineered and by now far behind industry developments, Level 1 is the lowest common denominator that is widely implemented, but this level only exchanges plain text segments, ergo of very limited use in industry. Level 2 is being ignored by vendors that do not provide inline mark-up semantics. TMX contains no standard mechanism for in-context matching information. Without SRX, which has issues of its own, interoperability is limited, leading to mighty leverage losses on tool migrations. Last but not least, there is no interoperability in the scoping area. The GMX-V specification cannot be considered a standard due to lack of reference implementations.
In one of the MultilingualWeb Workshops in Pisa, April 2011, I was so insolent as to say that TMX is a legacy format (dead, in other words) that had been failing to develop to a next generation version since 2004. In a following discussion it was noted that TMX is not dead, just as compact discs (CDs) are not. I think this does offer a suitable comparison. TMX is as dead as CDs — not yet as dead as vinyls; they do not have that sort of vintage feeling yet, as surviving legacy media lend a certain feeling of artistry if they continue to be used.
TMX is a legacy format that still plays an important role in collecting aligned corpora, migration and sanity check type of projects. But the importance of the format will continue dropping because the industry has developed multiple differentiators that fail to be properly captured by the format, and hence TMX falls shorter and shorter of reaching its postulated goal of lossless TM exchange.
In fact, LISA OSCAR made an attempt to release a 2.0 version, but the specification unfortunately had several issues. I should like to name but the few gravest ones: the draft was effectively created by two active members of OSCAR, so the TC failed to represent industry consensus. OSCAR failed to draw industry attention to the review because there was no consensus to represent. In fact, the standard fell so far behind the standard features of CAT tools that even the industry did not care to form a consensus. Furthermore, the draft proposed to break backwards compatibility with no real business benefit in exchange.
As LISA OSCAR had been working on the TMX 2.0 draft that later failed as we described above, Lommel had joined the XLIFF TC on behalf of LISA and proposed to the XLIFF TC a collaboration of the two bodies on inline markup standardization. The ensuing discussion led to formation of an XLIFF Inline Markup Subcommittee (chaired by Yves Savourel). A few XLIFF TC members at that point (including Savourel, the editor of TMX 1.4b) were of the opinion that TMX is too far behind the industry developments to catch up.
There had always been some basic goodwill between the two bodies, and in fact the lack of real and tight standardization of inline markup in XLIFF was largely due to XLIFF TC’s willingness to support inline codes specified by LISA OSCAR for TMX. The inline markup situation both in TMX and XLIFF has been often described by Rodolfo Raya as “markup salad.” Rightly so, but unfortunately the TMX 2.0 draft had not brought much of a breakthrough to the scene.
I conclude with an appeal to the industry to engage in open standards creation and to ensure that our standards remain truly open. It would seem unfortunate to me if localization standards became instruments for patent fees collection.