Standards

Localization Standards Reader 5.0

David Filip

David Filip is a researcher in next-generation localization project and process management and an interoperability standardization expert. He is a current member of the MultiLingual editorial board.

David Filip

David Filip is a researcher in next-generation localization project and process management and an interoperability standardization expert. He is a current member of the MultiLingual editorial board.

his reader is the fifth installment of MultiLingual’s encyclopedia-in-a-nutshell relating to localization standards. It does not purport to address everything around standards, and was last updated in late 2020. To be informative and of practical use, there must be a focus. This focus is being provided via two limitations. First, we are looking at standards that affect multilingual transformations of content, and at general content life cycle standards only, ad hoc and as far as they have bearing on multilingual transformations. Second, we are looking only at technical standards that are targeting actual technical interoperability — file or data formats and/or communication protocols. Abstract metadata, quality or service level standards are discussed only marginally, and only as long as they have bearing on real machine-to-machine interoperability in localization.

BCP 14 (previously also known as RFC 2119, and now includes RFC 8174)

Short Description: BCP stands for Best Current Practice. BCP 14 defines the standardization specific meaning of normative keywords such as MUST, MUST NOT, OPTIONAL, REQUIRED, RECOMMENDED and so on. RFC stands for request for comments, and these are numbered sequentially. RFC 2119 is the most common normative reference in other specifications throughout information technology standardization bodies. RFC 8174 was added to address the ambiguity of uppercase versus lowercase in RFC 2119 keywords. Localization-related standards such as ITS and XLIFF use BCP 14 keywords to make their normative statements that create the basis of conformance statements, testing and verification.

Owner: Internet Engineering Task Force (IETF), a nonmembership standardization body. Contributors are individuals who implicitly commit themselves by contributing without signing any formal contract. IETF creates internet-related technical standards, protocols, processes and nonnormative informational content. IETF is backed by the Internet Society.

Intellectual Property Rights (IPR) Mode: Reasonable and Non-Discriminatory (RAND), an IPR mode that allows owners charging for use of essential patents, provided that the charge is “reasonable” and “non-discriminatory.” This is a bit vague, but 2013 saw some groundbreaking legal development with regard to standards offered under RAND IPR policies. In a Washington District Court, Judge James L. Robart ruled in a RAND case between Microsoft and Motorola that an initial licensing offer from Motorola had not been made in good faith and therefore constituted a breach of contract, in particular the obligation to license essential patents under RAND conditions to all implementers. The methodology of setting the rates in this case was so well explained and logically constructed that it is being used as precedent not only by other US courts but also outside of the United States and by policy makers.

Current version and work in progress: For a very long time (20 years), BCP 14 had only RFC 2119 as its content, so many standard makers treated BCP 14 and RFC 2119 as synonyms. With the addition of RFC 8174 (RFC 2119 Clarification) it became important to distinguish if a standardization work product references BCP 14 as a whole, or just RFC 2119. RFC 8174 makes it clear beyond any reasonable doubt that the IETF notion of normative keywords is typograph-driven. For example, “MUST” or any other of the defined keywords has its normative meaning only as far as it is printed in uppercase, as opposed to the ISO notion of normative language that is concept-driven and in fact forbids typographical distinction of normative keywords.

RFC 2119 was released for unlimited distribution in March 1997. RFC 8174 updates RFC 2119 and was released for unlimited distribution in May 2017.

However, it is undesirable to make changes to this BCP because so many normative texts across IETF, W3C, OASIS and so on depend on the meaning of the normative keywords as set out here. It is questionable whether adding RFC 8174 was a good thing, as it deepened the rift between the IETF and ISO notions of normative keywords. It is too early to tell what the impact of the clarification was. Before, if uppercasing was omitted on a keyword in a spec, you could fix it in later publishing stages as it was considered editorial. In theory, it sounds clear to tell that only uppercased keywords are keywords, but in practice, it will make corrections difficult. Changing from plain to uppercase has been made into a material change by this supposed clarification. This clarification has some potential to cause interpretation damage. Until uppercasing was made mandatory, editors were dutifully aiming to avoid the keywords and use non-normative semantic variants. Now, many will stop doing that, citing RFC 8174, causing a lot of confusion in ordinary readers. It may well happen that this clarification will cause more changes to this BCP in the feature.

BCP 47: Tags for Identifying Languages

Short Description: BCP 47 is a normative IETF track that compiles recommendations on how to create a unique language tag from codes defined in several other normative sources, including ISO codes. It is frequently referenced by OASIS, Unicode and W3C standards.

Owner: IETF.

IPR Mode: RAND.

Current version and work in progress: RFC 5646 was released for unlimited distribution in September 2009. RFC 4647, Matching of Language Tags, was released for unlimited distribution in September 2006. BCP 47 is a persistent name that always points to the latest release, no matter what the current RFC number.

Additions to the BCP 47 component standard ISO 639-3 are under periodic review at its registration authority (RA), SIL International. The status of 2020 changes was determined and applied in December 2020. It is important to note that ISO 639-3 does not contain the actual list of the alpha-3 codes as the ISO 639-2 does. The reason is simple: ISO 639-2 only contains several hundred of language codes for languages with “large bodies of literature”, and is maintained by the Library of Congress. The actual code lists in the standard are only updated at the times of the standards revision, while ISO 639-3 covers almost 8,000 languages following the annually published Ethnologue reports by SIL. The actual alpha-3 codes for ISO 639-3 are continually published at SIL International’s publicly accessible pages at https://iso639-3.sil.org/code_tables/639/data.
BCP 47 itself is stable, which is important for backward and forward compatibility. New component tags are being continuously registered via registration authorities specified in the standard. Other current developments are connected to Unicode extensions for BCP 47.

BCP 47 Extension T: Transformed Content

Short Description: Extension T is possible via the extensibility mechanism defined in BCP 47 (RFC 5646) itself. Extension T has normative status within the Unicode Consortium, as it is being maintained as part of CLDR, which is its major normative deliverable. This extension allows for additional tags specifying from which other language, locale or script the content at hand had been transformed. Extension T is not recommended for usage in structured environments such as XML, where this type of metadata can be specified using markup solutions rather than a single text field. Note that Extension T is appending the information about the originating language or locale with a leading “t,” which means that the BCP tag starts with the target locale and the source locale is appended. This makes sense given the structure of BCP 47 tags, but may be perceived as contrary to the customary listing order of source and target languages, so “EN-t-IT,” for example, actually means that the tagged content is English but was transformed from Italian, not the other way around.

Owner/Maintainer: IETF/Unicode Consortium.

IPR Mode: RAND.

Current version and work in progress: Informational RFC 6497, published in February 2012. While RFC 6497 itself is stable, Extension T data and field definitions are regularly maintained as part of the CLDR release cycle.

BCP 47 Extension U: Unicode Locale Extension for BCP 47

Short Description: Extension U is possible via the extensibility mechanism defined in BCP 47 (RFC 5646) itself. It has normative status within the Unicode Consortium as it is being maintained as part of CLDR.

Owner/Maintainer: IETF/Unicode Consortium.

IPR Mode: RAND.

Current version and work in progress: Informational RFC 6067 was published in December 2010, and is maintained by the Unicode Consortium as part of CLDR. Extension U is regularly maintained as part of the CLDR release cycle. The Unicode Localization Interoperability Technical Committee — known since early 2018 as the CLDR Technical Committee Subcommittee (CLDR TC/ULI SC) — delivers the input exception data for sentence-breaking mechanisms for different locales for the periodic CLDR releases. As a result, sentence-breaking behaviors driven by different exception data can be specified through assigned keys under the extension’s U mechanism.

Notably, the canonicalization algorithm included in RFC 6067 is slightly out of date. The current provision is specified in UTS #35. The canonicalization algorithm should be therefore updated in RFC 6067.

A Few Terms

bidirectional: a mixture of characters within a text where some are read from left to right and others from right to left. Bidirectional or bidi refers to an application that allows for this variance.

content management system (CMS): a system used to store and subsequently find and retrieve large amounts of data. CMSs were not originally designed to synchronize translation and localization of content, so many have been partnered with globalization management systems.

CHM: an extension for the Compiled HTML file format, most commonly used by Microsoft’s HTML-based help program.

Extensible Markup Language (XML): a programming language/specification pared down from SGML, an international standard for the publication and delivery of electronic information, designed especially for web documents.

intellectual property rights (IPR): rights relating to creations of the human intellect, primarily encompassing copyrights, patents and trademarks.

Organization for the Advancement of Structured Information Standards (OASIS): an IT standardization consortium based in the state of Massachusetts. It works on the development, convergence and adoption of open standards for a variety of areas. Its foundational sponsors include IBM and Microsoft. Localization buy-side, toolmakers and service providers are also well represented.

Simple Object Access Protocol (SOAP): a messaging protocol that allows programs that run on disparate operating systems (such as Windows and Linux) to communicate using Hypertext Transfer Protocol (HTTP) and its Extensible Markup Language (XML).

technical committee (TC): standardization bodies usually own, create, maintain and update technical standards through purpose-specific technical committees. In organizational structures such as OASIS, Unicode and ISO, they are called technical committees, while in others such as W3C they are not. They may also be referred to as an Industry Specification Group, Working Group, Special Interest Group and so on.

translation management system (TMS): sometimes also known as a globalization management system, a TMS automates localization workflow to reduce the time and money employed by manpower. It typically includes process management technology to automate the flow of work and linguistic technology to aid the translator.

Web Service Definition Language (WSDL): an XML format for describing network services as a set of endpoints operating on messages containing
either document-oriented or procedure-oriented information.

World Wide Web Consortium (W3C): an international community that develops and owns many standards, including XML and HTML.

XML Schema Definition (XSD): a W3C recommendation that specifies how to formally describe the elements in an Extensible Markup Language (XML) document.

CLDR

Short Description: Unicode Common Locale Data Repository, http://cldr.unicode.org, is a standard repository of internationalization building blocks, such as date, time and currency formats, sorting (collation) rules and so on. CLDR is not a standard in a classical sense. It is, as the name suggests, a repository that is being constantly updated and released on a rolling basis following its data release process.

Owner: Unicode Consortium.

IPR Mode: RAND.

Current version and work in progress: Version 38 was released on October 28, 2020.

A limited winter submission period for version 39 started in November 2020, and the full summer submission period for version 40 will start in May 2021. CLDR is being released on a regular semiannual schedule, whereas the cycle starting in the fourth quarter of each year is focused on tooling and bug fixing, and usually skips the public data submission phase.

Common Translation Interface (COTI)

Short Description: The Association of German Manufacturers of Authoring and Content Management, shortened to DERCOM in German, created this standard.

DERCOM claims that it had to step up to protect and promote the interests of its membership and, at the same time, to let translation providers use the translation management systems (TMS) of their choice. DERCOM argues that no such interface exists, and it is true that a common interface is better than the present jungle of proprietary APIs on both ends that can only provide interoperability after some custom development on both ends for each new interface. Arguably this can be addressed by an Enterprise Service Bus or messaging architecture, but not every content management system (CMS) owner has the resources or technical muscle to do that.

COTI level 3 provides a sophisticated state machine that provides synchronous support for translation orders, cancellations and even updates of existing orders, which is notoriously tricky. While DERCOM is right that no translation web service interface has been standardized, it is surprising that COTI only standardized the business metadata interface and doesn’t say anything about payload standardization, standardization over a canonical data model such as XLIFF lends itself, but COTI actually does not mandate any data model restrictions with regard to the payload. So actually this specification only solves the CMS part of the equation. The bundle is thrown over the wall and it’s up to the language service provider to figure out what is localizable, what is not and so on.

While German CMS providers might be happy to implement always-synchronous Simple Object Access Protocol (SOAP) web services, the general trend seems to be rather toward RESTful interfaces and microservices, and it might hinder the adoption of COTI that level 3 strictly enforces synchronous SOAP calls. REST refers to Representational State Transfer, a widely known web services architectural best practice that has never been officially standardized. Unlike SOAP, which is a specific standardized protocol, there is no single REST standard or even a single best practice. REST is rather an architectural style for interoperability.

Level 1 is actually the specification of the payload ZIP package, its structure and manifest. Sound familiar? People just keep reinventing this very wheel, albeit each time with subtle differences that don’t allow for automation among those “standard” ZIP packages. Electronic Dossier, Linport, TIPP and now COTI level 1 — all of these just defined a folder structure that they think a translation request and response should contain without talking to each other. Alan Melby and others invested some diplomatic effort into making the Electronic Dossier people talk to the Linport people and merge the Linport project with TIPP. But even before this could bear fruit, DERCOM COTI brought another “standard” folder structure.

COTI level 2 mandates automated exchange of the structured ZIP packages over hot (watched) folders. Whatever is placed in a hot folder is up for grabs at the receiving end. The TMS watches the CMS’s hot folder for new projects and the CMS watches the TMS’s hot folder for responses. Sound like the 1990s?

The interesting contribution to the topic of standardized translation web services is in the level 3 compliance that requires SOAP and doesn’t allow or specify any other admissible bindings, which would certainly be advisable.

Owner: DERCOM. DERCOM is not a standardization body but a German trade association. They nevertheless decided to provide German-producer-centric standardization, and, as they say, “non-proprietary” solution to CMS and TMS interoperability.

IPR Mode: Unclear. The specification, documentation (CHM) and validation artifacts (WSDL, XSD) are available for free, but no restrictions seem to have been specified with regard to licensing of IPR essential to implementing the specification. The specification says to look up “the intellectual property rights section of the technical committee web” at www.dercom.de but there seems to be no such section. This unfortunately means that not even (F)RAND restrictions apply for licensing offers made for essential IPR by their respective owners. This is at least until DERCOM actually comes out with publicly visible IPR information.

Current version and work in progress: COTI version 1.1.1 was put out July 07, 2017, as a minor update from version 1.1 put out April 29, 2016. The 1.0 version was first approved on May 25, 2014, and was subject to only minor editorial fixes by September 2015. Version 1.1 brought some additional material such as the XML Schema, but also changes in the WSDL artifacts.

There doesn’t seem to be a specific plan to further develop the specification. Rather it seems that DERCOM makes changes and fixes as its membership gradually progresses in fulfilling its obligation to implement COTI level 1 through COTI level 3 compliance. The WSDL changes in the version 1.1 were most probably driven by implementation experience of the DERCOM membership.

International Components for Unicode (ICU)

Short Description: Until May 2016, ICU was an IBM-driven open source project, and in fact the most important reference implementation of both Unicode and CLDR. As ICU provides internationalization libraries, it actually consists of two subprojects, ICU4C and ICU4J, that provide libraries for C and C++ and Java respectively. Spectacularly, ICU is the common denominator of both Android and iPhone operating systems. ICU 58 is the first ICU version released under the ICU TC governance.

The ICU project implicitly defines a “message format” (See Message Format). Since the ICU message format has been implemented in number of internationalization libraries and usage of those messages has been known to cause interoperability issues on the boundary with localization, a new Message Format Working Group (MFWG) has been formed under the Unicode CLDR TC to work on the message format successor.

Owner: Unicode Consortium.

IPR Mode: RAND.

Current version and work in progress: ICU 67 for Unicode 13.0 and CLDR 37 is the current version. ICU4J 67.1 released on April 22 2020. The URL https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/ always points to the current official version. The legacy URL now redirects to the common cover page for both ICU4J and ICU4C.

ICU (both the Java and C projects) is being developed continually to address bug fixes and dependencies. Major releases are driven by the releases of the Unicode Standard and changes in CLDR data. New major releases came have been released in late 2020 just after stabilization of CLDR 38 at the end of October. Since all ICU infra migrated from SVN to GitHub back in summer 2018, the project transitioned their spec reference mechanism to unicode-org.githib.io — GitHub enabled and hosted documentation webpages.

Internationalization Tag Set (ITS)

Short Description: ITS 2.0 comprises 19 metadata categories compared to a mere seven in ITS 1.0. Additionally, ITS 1.0 metadata categories were primarily designed for internationalization of XML content. Nevertheless, as abstract data categories, ITS can be implemented in non-XML environments. Importantly, ITS 2.0 normatively specifies usage of the old and new ITS data categories for XML and HTML 5 content, and new categories have been introduced that explicitly address the localization roundtrip, such as the localization quality assurance related data categories Localization Quality Rating and Localization Quality Issue. Importantly, ITS 2.0 is listed in the JTC 1 Big Data Standards Roadmap (ISO/IEC TR 20547-5) as a key automation enabler for big data and analytics dealing with human language.

Owners of ITS decorated content want their internationalization and localization related metadata to inform the roundtrip and make it to the target content in a meaningfully processed state that allows for drilling down into the process and for reconstructing the audit trail. Localization workflow managers should pay attention to information flows directed by the ITS data categories introduced by their customers up in the tool chain. There is also potential to introduce automated or semiautomated ITS decoration steps before extraction, or to introduce relevant XLIFF mappings of ITS data categories recorded on translations during the localization roundtrip that can be imported back into ITS in XML or HTML on merging back localized content. Categories such as Translate (which had become a native HTML 5 attribute), Elements Within Text, Locale Filter, Target Pointer or External Resource should drive extraction and merging back of localizable content. Terminology and Text Analysis disambiguation markup should be passed on to human and machine translators. Proper interpretation of directionality markup is a must for sound handling of bidirectional content using Arabic or Hebrew scripts. Self-reported machine translation (MT) confidence should be passed on to the content recipients, possibly along with quality assurance related metadata (Localization Quality Issue and Localization Quality Rating).

These considerations are especially valid when hooking up existing localization workflows upward into the tool chain. Existing workflows should introduce mappings of ITS data categories used in source content, so that the metadata flow is not broken throughout the content life cycle.

The current ITS 2.0 categories are Translate (flag indicating translatability or not); Localization Note (for alerts, hints, instructions); Terminology (to identify terms and nonterms and optionally provide definitions); Directionality (manages left to right/right to left display behaviors of content portions); Language Information (BCP47 language tags on relevant content portions); Elements Within Text (shows which elements break flow or not to help encode segmentation); Domain (to identify content theme or topic for better choice of relevant services or creation of training corpora); Text Analysis (for inclusion of automatically provided disambiguation data and term candidates); Locale Filter (indicates which portions of text are relevant for which locales); Provenance (tracks agents — machine or human — involved in content transformations); External Resource (global pointers to external localizable resources such as images or other binary data); Target Pointer (identifies transformation targets in multilingual documents); ID Value (rule to provide unique content identifiers to be maintained throughout transformations); Preserve Space (specifies whitespace handling); Localization Quality Issue (provides a way to mark up and classify specific language quality assurance issues); Localization Quality Rating (quality rating expressed in a single 0-100 score); MT Confidence (provides self-reported MT quality score as a single 0-1 number); Allowed Characters (specifies restrictions on character data using simple regular expressions); and Storage Size (a simple encoding-based restriction mechanism to make sure that translations fit restricted database fields, forms and so on).

Owner/Maintainer: MultilingualWeb-LT Working Group/ITS Interest Group (IG). The World Wide Web Consortium (W3C) ITS IG has been the informal maintainer of ITS 2.0 after the MultilingualWeb-LT Working Group mandate expired; however, interest groups in W3C cannot create normative deliverables. Therefore a new Working Group will have to be formed to commence work on the successor standard, as soon as the ITS IG identifies industry need for a successor version.

However, the status of the ITS IG is unclear, as its charter expired on December 31, 2018, and it is not clear at the moment if the charter can be extended. In case ITS IG lapses, stakeholders of ITS should initiate a W3C community group to look after the adoption and maintenance of ITS 2.0. The main purpose of such a group is to see if there’s a need and momentum to create a new feature or maintenance release of the Internationalization Tag Set.

IPR Mode: Royalty Free (RF), an IPR mode that mandates and guarantees royalty free use of essential patents in order to implement a standard.

Current version and work in progress: The current version is 2.0, published as a full W3C Recommendation on October 29, 2013. Work on another major version is not imminent. That said, ITS 2.0 covered a number of new areas with nonnormative mappings. These included Resource Description Framework (RDF) and XLIFF mappings. See XLIFF for inclusion of the ITS functionality as an XLIFF 2.1 module. Current collaborations and liaisons include OASIS DITA, DockBook, XLIFF and XLIFF OMOS TCs, as well as GALA TAPICC.

ITS IG is working on further elaborating and maintaining other informative mappings, such as NIF and MQM. However, the MQM mapping aspect transitioned to the MQM W3C Community Group. ITS IG also maintains ITS extensions that support categories that could not be normatively specified for various reasons within the ITS 2.0 publishing schedule; for instance, the Readiness data category.

JSON Localization Interchange Fragment Format (JLIFF)

Short Description: JLIFF aims to be compliant with the abstract object model defined in UML diagrams and a prose specification as XLIFF OM. The TC currently works on XLIFF unit JSON representation examples, JSON schema and JSON templates to represent the XLIFF OM UML diagrams and XLIFF unit XML examples.

Owner: OASIS XLIFF Object Model and Other Serializations (OMOS) TC. JLIFF is currently available as a JSON Schema that is reasonably stable but not yet officially published by OASIS.

JLIFF’s main intended use case is the real time interoperability of TMS and CAT tools that support the XLIFF 2 data model. Importantly and unlike XLIFF, JLIFF can support exchange of fragments at unit, group and file levels. For full XLIFF interoperability, it also supports exchange of whole XLIFF Document JLIFF equivalents, but this is not the main use case. It is important to stress that in order to preserve data integrity the lowest interchange level is that of logically separate bitext units and not arbitrary segments.

IPR Mode: Non-Assertion (RF).

Current work in progress: In summer 2018, JLIFF JSON 2.0 and 2.1 schemas were completed by the committee on the XLIFF OMOS TC GitHub repository. The TC started working on a prose specification to be reviewed hopefully in early 2021. JLIFF schema has been attracting implementers even before becoming stable. Notably, Vistatec released an open source implementation of core JLIFF in March 2018 at https://github.com/vistatec/JliffGraphTools. Reportedly, TDC has implemented a RESTful API that is capable of three-way transposition among JLIFF, XLIFF 2, and XLIFF 1.2. It seems that the capability of JLIFF to create fragments compatible with the XLIFF 2 data model is exactly what implementers need for real time API-based interoperability.

This is also the assumption behind the GALA TAPICC Track 2, which was chartered to specify a real time translation API (JRARTEBU – JLIFF REST API for Real Time Exchange of Bitext Units).

The XLIFF OMOS TC was chartered in December 2015 and started to develop a JSON serialization in parallel to the creation of the abstract object model for the XLIFF 2 family of standards that it is developing as its first priority. The objective is to create such a JSON serialization of XLIFF and XLIFF fragment data categories that would allow lossless interchange between JLIFF and XLIFF based tool stacks. The development of the JLIFF specification and artifacts is being conducted on this GitHub repository: https://github.com/oasis-tcs/xliff-omos-jliff. The repository is public but the contributors need to be or become OASIS XLIFF OMOS TC members. However, anyone can raise issues or comments via the associated issues or wiki at https://github.com/oasis-tcs/xliff-omos-jliff/wiki, as well as minor bug fixes via pull requests. It’s up to the repository maintainers if those pull requests will be merged or not, which is the usual GitHub collaboration workflow.

Message Format (MF)

Short Description: ICU message format has been in use for 20 years through ICU APIs (both in ICU4C and ICU4J). It is at the core of many internationalization libraries across operating systems and development frameworks. A couple of chief issues of the current format were identified as part of the initial requirements analysis: The design is not modular, doesn’t allow for deprecation of obsolete or addition of new required features. Capability to use internal selectors within messages often leads to translatability issues, typically in cases where the target language is more morphologically complex than the source language and hence the target language requires more message variants than the source message postulates. This causes situations where the linguist is not able to produce all the variants required in the target language or structure their translation, so that it could remain grammatical in the target language in all cases after the variables and selectors had been resolved in application runtime.

Owner: Unicode Consortium.

IPR Mode: RAND.

Current version and work in progress: ICU MessageFormat, implicitly defined as part of ICU (See ICU). Reacting to issues associated with ICU MessageFormat and to demand expressed by ECMA TC39 (JavaScript), in late 2019, Unicode Consortium formed a Working Group under its CLDR TC (Unicode CLDR TC/MFWG) to work on a Message Format successor, informally known as Message Format 2.0. The group was chartered to produce a Unicode Technical Standard (See UTS) that will describe the new format. The group further refined its goals within the mandate it received from the Unicode Consortium and ECMA to produce the format, related APIs, and importantly a mapping to and from XLIFF 2 to enable an interoperable localization roundtrip of those messages.

This work is still in the inception phase, as the group was chartered and has agreed on its goals and also stated potential objectives that are considered out of its scope (no goals). The design principles discussion is currently in full swing and will probably continue for some time as first proof of concept efforts are being explored by group members to validate the slowly growing set of design principles. The new message format group set the goal to be translation roundtrip friendly and to define an XLIFF 2 mapping for localization interoperability. Importantly, this group’s effort is driven by internationalization engineers and developers across major buyer organizations such as Amazon, Apple, CaixaBank, Dropbox, Expedia, Facebook, Google, Mozilla, Oracle, and PayPal.

Multidimensional Quality Metrics (MQM)

Short Description: MQM is primarily a set of error data categories that can be used in language quality assessment, especially in bilingual translation scenarios, but also in monolingual review or authoring scenarios. The aim of MQM is to cover the whole logical space of possible errors and nonconformities related to language and locale-specific presentation of content of any kind.

It is impossible (and even if possible not likely very useful) to use all the error data categories from all levels and sub-levels. Profiling is a key method defined by MQM, so that in any particular project or job a particular profile or subset of relevant error data categories has to be defined based on stakeholder requirements and expectations. Thus MQM concentrates on quality in the sense of fitness for purpose. It specifically rejects the notion of some general or abstract quality without defining the purpose, related requirements and a resulting relevant set of possible error types. Also notable is that MQM doesn’t define any severity levels. There is an option for implementers to predefine severity levels and scoring methods based on any particular MQM subset.

Currently MQM is a proper superset of TQF. MQM and TQF had been developed separately by the QT21 EU project (coordinated by DFKI) and TAUS respectively. The EC funders effectively forced these two projects to reconcile their data models.

For effective exchange of MQM/TQF metadata within projects and jobs, inline capturing and encoding of the recorded errors is critical. This needs to be implemented as the ITS localization quality issue data category in native formats or via the same data category implemented as an XLIFF Version 2.1 module in bitext.

Owners: ASTM International (ASTM), founded in 1898, formerly known as American Society for Testing and Materials, as well as the Worldwide Web Consortium (W3C).

IPR Mode: RAND at ASTM. No IPR mode under W3C, as it is not being developed under the recommendation track, just republishing the ASTM maintained structure of data categories.

Current version and work in progress: Version 1.0 was released December 30, 2015. This URL leads to the latest published version: www.qt21.eu/mqm-definition.

The W3C MQM Community Group has been reactivated to publish updates from ASTM F43 at www.astm.org/DATABASE.CART/WORKITEMS/WK46396.htm.

The W3C WG had so far published one full draft that explicitly did not take into account updates from ASTM F43. These updates were promised to be included in the next draft. To date however only two new working drafts (with unresolved editorial notes). Most recently (in June 2020) a draft of only the top level categories was published, which superseded the April 2019 version.

A working list of MQM Terminology was published on September 21, 2020. This is presumably terminology adopted by ASTM F43.
As to the developments inside ASTM, which are normally only visible to members, the current target is to have an ASTM WK46396 (MQM) ballot-ready in April 2021 with subsequent publication following the consensus-based ASTM process. In the interim, the project team within the F43 committee will participate in stakeholder engagement to improve the draft and ensure that the resulting document is ready for approval and adoption. In the opinion of standards expert Alan Melby, WK46396 has completed development and harmonization of the terminology in the standard and related documents. It has made substantial updates to the MQM error typology in response to feedback and to support harmonization with ISO 5060 that is in working draft stage at ISO TC 37/SC 5. ISO 5060 and the MQM teams have settled on consistent terminology for error types and are working toward a common basis for categorization for errors.

The ASTM F43 committee has created the first draft of a root cause analysis and a typology of root causes for translation errors, effort was made to harmonize also with ASTM WK54884. Further in the pipeline for MQM are reference quality scores that will allow comparison of quality evaluation across organizations and translation scenarios, as well as translation grades as a way of discussing expectations for translation.

Looking at the above described developments it seems that MQM continues to diverge from the Localization Quality Issue Types in W3C ITS (See ITS) and hence doesn’t make progress in making the MQM data and metadata practically usable inline within source text, bitext, and target text.

Open Lexicon Interchange Format (OLIF)

Short Description: OLIF is a stable and relatively widely used lexicon interchange format. It has a rich metadata structure and allows for the exchange of complex lexicon entries for various purposes, such as terminology management and MT. OLIF had been designed for use in both monolingual and multilingual context via cross-linking of “mono” elements.

Owner: OLIF Consortium, an ad hoc industry consortium driven by SAP and set up in 2000.

IPR Mode: Unclear. The specifications and schemas are available for free, but no IPR mode seems to have been specified.

Current version and work in progress: Version 2.1 is current. Version 3 has been in beta since 2008; no current work seems to be under way.

Segmentation Rules eXchange (SRX)

Short Description: SRX is an XML vocabulary that facilitates the exchange of segmentation rules between TMX compliant systems. SRX’s relationship to Unicode is not a transparent one, and SRX can be considered incomplete from the engineering point of view. However, its proclaimed goal was not to provide a set of segmentation rules for a number of languages, but rather to provide a mechanism to exchange the rules to improve TMX interoperability. TMX often fails to guarantee its targeted lossless transfer of TM data due to segmentation differences, chief among other issues. The current SRX incarnation works on a closed world assumption, meaning it recreates (and adapts) UAX #29 rules. UAX #29 is referenced and its study encouraged, but the relationship is currently not a maintainable linkage.

According to former Unicode Localization Interoperability Technical Committee (ULI TC) chair Helena Chapman, SRX developers should not take UAX #29 (and its ICU implementation) for granted, and use SRX only for the exchange of rules that differ from the standard UAX #29 behavior. ULI TC has been collecting natural language exceptions to UAX #29 for several major languages and included these in the CLDR release cycle. CLDR, however, uses LDML as its description language, not SRX.

Owner: ETSI ISG LIS. ISO/TC 37/SC 4 was intended as a co-owner of SRX. However, no memorandum of understanding between the groups was signed. This work item expired and was deleted on July 22, 2013, according to the ISO standards publishing policy.

Unicode ULI TC (now CLDR TC/UL SC) agreed to host the current SRX version after ETSI ISG LIS disbanded as a publicly available specification. However, it is unclear if ULI SC would be legally allowed to produce a new version without formally negotiating IP transfer from ETSI.

IPR Mode: RF for the LISA version; FRAND in ETSI (but wasn’t republished).

Current version and work in progress: A copublication attempt was made with ISO TC37/SC 4 as ISO CD 24621, which successfully passed the committee draft ballot, but no work progressed and the project was deleted on July 22, 2013.

There is no current work in progress due to ETSI ISG LIS dissolution. Version 2.0 was released by LISA OSCAR on April 7, 2008. Based on the published executive summary from November 2011, ETSI ISG LIS scheduled an SRX meeting for March 2012. This meeting happened behind closed doors. There are or should be dependencies with ULI TC on UAX #29 segmentation behavior modifications. SRX has potential as an XML based exchange vehicle for segmentation rules, because there is no ultimate finite solution to the segmentation issue in natural languages. In 2018, ULI TC started a note on segmentation and wordcounting that would build on the principles of UAX #29. The development of this note was abandoned in 2019 due to lack of interest.

Simple Object Access Protocol (SOAP)

Short Description: SOAP is an XML-based web services protocol. Version 1.0 was submitted to IETF in autumn 1999 as an internet draft but never reached an RFC status, so it actually hadn’t become a standard at IETF. SOAP 1.1 had Note status at W3C and the only SOAP version that ever reached the standard status (W3C Recommendation) is SOAP 1.2.

The SOAP protocol consists of five layers: message format, transfer protocol bindings, message processing models, message exchange patterns and extensibility. Unlike REST, which is strictly HTTP-based, SOAP is protocol-neutral and can work over several lower level protocols such as HTTP (Hypertext Transfer Protocol), SMTP (Simple Mail Transfer Protocol), TCP (Transmission Control Protocol), UDP (User Datagram Protocol) or JMS (Java Message Service).

Owner/ Maintainer: W3C XML Protocol Working Group (closed) / no current maintainer. The SOAP 1.2 multipart recommendation was produced by the XML Protocol Working Group at W3C, and the group was closed on July 10, 2009.

IPR Mode: RF.

Current version and work in progress: 1.2 (Second Edition) was published as a W3C Recommendation on April 27, 2017. The protocol as such is not being developed or maintained at W3C or elsewhere. We list SOAP because it was resurrected in our industry by the COTI level 3 conformance requirement to implement a SOAP-based web service automation.

TBX, also known as ISO 30042

Short Description: TBX is a family of XML-based terminology markup languages that should allow for lossless exchange of terminology-related data and metadata. So far, two more lightweight versions known as TBX Basic (published in 2008 by LISA Terminology Special Interest Group) and TBX-Min (published in 2013 by LTAC Global) have been developed. TBX-Min should be targeting the use case of exchanging terminology with translators in the form of simple glossaries mappable onto UTX. However, TBX Basic is more suitable for mapping between TBX and XLIFF 2.0 with glossary modules.

TBX has been criticized for industry disconnect, for being too heavy on one hand and being too restrictive and not very suitable for MT training on the other. One of the reasons might be that TBX is supposed to be both a representation and exchange format for terminology, but it has been struggling to define a minimum set of terminology metadata suitable for practical interchange in a localization context. Some industry implementations can hardly be considered in the spirit of the standard, as they have not enforced inclusion of even very basic metadata such as part-of-speech or they are not structurally compliant with any of the predefined data structures.

Owner: ISO TC 37/SC3. LTAC for all public and some private dialects; LISA (defunct as of 2011) and ETSI for legacy versions published by 2008.

IPR Mode: RAND within ISO. BSD 3 clause licensed an open source project at LTAC/TerminOrgs.

Current version and work in progress: ISO 30042:2019 (informally TBX 3) is the current version. TBX 3, ISO 30042:2019, is only the second ISO edition. Because it is the second edition at ISO, its core namespace is and must be urn:iso:std:iso:30042:ed-2. Despite that it’s the third major version of TBX considering the developments in LISA OSCAR before the ISO co-publishing agreement.

The TBX Steering Committee has agreed to focus promotion efforts on encouraging CAT tool vendors to implement a feature that allows import of files that comply with the version 3 (2019) TBX-Basic dialect, which can be found at https://ltac-global.github.io/TBX-Basic_dialect/ (always current).

Focus on just this one dialect should promote real interoperability. LTAC and the BYU TRG have been working together on a “Steamroller” application that will manipulate version 3 TBX files from dialects bigger than Basic into compliance with the Basic dialect. LTAC is also working with IATE to provide a version 3 TBX-Basic export feature. While there is a number of private dialects that are “bigger” than TBX-Basic, TBX-Basic should provide enough power in terms of external interoperability. On the other hand, TBX-Basic is bigger and more powerful then the other two public dialects, TBX-Core and TBX-Min. Serious L10n users (mature buyers and service providers) are well advised not to use dialects smaller than TBX-Basic.

The most important data model change that happened in ISO 30042:2019 was making the inline data model compliant with the XLIFF 2 inline data model. This was implemented by the LTAC/Terminorgs TBX Steering Committee and TC 37/SC 3 with input from OASIS XLIFF and XLIFF OMOS TCs. The modular public dialects (TBX-Core, TBX-Min, and TBX-Basic) compatible with the second edition of ISO 30042:2019 were released during 2018, as the CD and DIS version of the second ISO edition stabilized towards FDIS (no more material changes possible after FDIS, technical compatibility with the future standard was guaranteed after summer 2018), also with the new inline data model.

The second ISO edition of TBX accumulated quite a number of breaking changes. But most of those changes are dealing with modernization of the XML tooling. Apart from the inline data model change and related introduction of explicit directionality support, it is important to stress that TBX joined other modern-day standards in leaving its former monolithic design for a new modular design.

This edition of TBX specifies a nonnegotiable core and prescribes that all compliant dialects must include that core. The standard then specifies a modularity mechanism. Outside of ISO, LTAC made sure that the most common dialects are extended from the core using a telescoping principle. The simplest possible dialect is TBX-Core; the TBX-Min is composed from the Core and Min modules. TBX-Basic is TBX-Min plus the Basic module, and TBX-Linguist (considered a private dialect to allow flexibility for academic users) is TBX-Basic plus the Linguist module. The stakeholders have high hopes that this will vastly improve the so-called “blind” or plug and play interoperability. The second ISO TBX edition also defines TBX agents and provides a compliance clause targeting the document compliance, as well as specific agents’ compliance. Notably, this edition introduces namespace-based modules, albeit only in the DCT (data category as tag name) style, which is a compromise that should not disturb those who are not worried with the DCA (data category as attribute name) style. And vice versa — implementers who are afraid of implementing namespaces can stick to the old DCA style. Helpfully, both DCA and DCT are extended from the same core, and modules specified in either DCA or DCT are semantically equivalent (based on the same sets of additional data categories). DCT is easier to validate and it’s also easier to filter out unsupported modules (as those are in different namespaces) in the DCT style.

GALA and ttt.org still host the latest LISA OSCAR version including the legacy TBX-Basic. TBX-Basic 3.1 (the version maintained by LTAC/TerminOrgs, last updated Sep 12 2014) is the latest version of TBX-Basic compatible with ISO 30042:2008.

Although ISO 30042:2008 (informally TBX 2) was superseded by ISO 30042:2019 (informally TBX 3), this TBX 2 based Basic dialect version is still widely used. It is somewhat confusing that this legacy version of the Basic dialect is versioned as 3.1, although it is not forwards compatible with the current TBX 3 version, which is ISO 30042:2019.

ISO 30042 will undergo another systematic review at ISO by 2024, but no plans regarding that have been formulated yet. As ISO 30042:2019 is fairly fresh, most of the current work concentrates on evangelizing the advantages of the new modernized and modular design to the implementers. ISO TC 37/SC 3 launched a couple of companion projects to ISO 30042 to provide additional information and guidance. These projects are ISO WD TR 24633 Building an RNG schema for TBX Core and ISO WD TS 24634 Management of terminology resources — Representation of concept relations and subject fields in TBX.

Work in liaison with the OASIS XLIFF TC and later with OASIS XLIFF OMOS TC resulted in development and testing of a TBX-Basic to XLIFF 2 with glossary module mapping. The mapping upgraded to the modular second ISO edition based on TBX-Basic — semantically equivalent with the old TBX-Basic — is now being specified on the standards track within OASIS XLIFF OMOS TC. See the latest editor’s draft at https://tools.oasis-open.org/version-control/browse/wsvn/xliff-omos/trunk/XLIFF-TBX/xliff-tbx-v1.0.pdf.

Translation API Cases and Classes (TAPICC)

Short Description: The first TAPICC symposium, on October 26, 2016, brought a four-way categorization of use cases that were considered in scope of the Translation API (TAPI) standardization: 1) Exchange of a payload blackbox accompanied with rich business or project metadata. In this area, TAPICC wants to abstract common business metadata used in other initiatives past and current including Linport, TIPP, COTI, XLIFF 1.2 project group, OASIS Translation Services TC (closed) and so on. 2) Bidirectional real-time exchange of XLIFF unit data in arbitrary serializations. 3) Enriching of XLIFF units in arbitrary serializations with terminology, translation suggestions, text analysis, process, quality assurance metadata and so on. 4) Exchange of data for layout representation purposes during the bitext management process.

Scenarios 2) and 3) were deemed to be dependent on work in progress within OASIS XLIFF OMOS TC and hence were put on the back burner at TAPICC until XLIFF OMOS TC produces the required abstract data models as well as JSON (JLIFF) and other non-XML serializations. Although scenario 4) was deemed in scope of the TAPICC project, neither the Globalization and Localization Association (GALA) nor XLIFF OMOS TC have resources to tackle the scenario at the moment, therefore volunteers to engage in the API specification for this scenario are sought. The groundwork for this has been done at OASIS XLIFF TC and the provisions of the XLIFF 2 resource data module can be used to address this. Work on non-XML payload exchange would have a dependency on the XLIFF OMOS work in progress for scenarios 2) and 3).

Owner: GALA TAPICC Group.

IPR Mode: The steering committee originally targeted Non-Assertion (RF) but instead opted to be organized as an open source project licensed under the BSD 3 Clause license for code, and Creative Commons 2.0 BY for documentation. GALA decided not to mimic an SDO IPR mode but to donate stable deliverables to OASIS XLIFF or OASIS XLIFF OMOS technical committees, where these would be maintained under RF or Non-Assertion IPR modes respectively.

Current version and work in progress: XLIFF 2 Extraction and Merging Best Practice (XLIFF EMBP) are in version 1.0. https://galaglobal.github.io/TAPICC/T1/WG3/XLIFF-EM-BP-V1.0-LP.xhtml returns the latest published GALA TAPICC version.

Version 1.0 of the XLIFF EBMP was contributed by GALA to the OASIS XLIFF TC for publication as a TC note.

Several other deliverables (including the asynchronous API bringing together inputs from all T1 TAPICC groups) have not yet reached stability but engaged in a number of public consultations via GALA webinars and so on. Further progress was stalled by GALA not being able to organize the planned spring 2020 TAPICC face to face meeting as GALA 2020 pre-conference in San Diego, California.

TAPICC is in a staged development phase, where the Track 1, chartered on February 8, 2017, has four active working groups. T1/WG1 was chartered to develop a consensual set of business metadata that will enable efficient payload exchange. T1/WG 2 was chartered to specify what kind of payload will be allowed to be exchanged via the TAPICC API, and how to resolve potential conflicts between payload encoded metadata and the business level imposed metadata. T1/WG3 was chartered to work on best practices of creating and consuming XLIFF 2 including documenting public libraries and services. T1/WG4 was chartered to design an actual RESTful API that will work with data models specified by the other T1 working groups, most importantly the T1/WG1.

The joint purpose of the T1 groups is to enable asynchronous API exchange within the complex localization supply chain, while relying on an XLIFF 2 based canonical data model.

In October and November 2018, TAPICC chartered and launched a call for volunteers for T2/WG1 — widely publicized in tcworld 2018 in Stuttgart on November 14, 2018. T2/WG1 – JLIFF REST API for Real Time Exchange of Bitext Units (JRARTEBU) has currently been assigned one Work Package: WP1 Unit Exchange. Batch or buffered exchange is out of scope of WP1, although not out of scope of Track 2 or T2/WG1. SOAP and any other protocols or bindings are out of scope of T2/WG1, although not out of scope for TAPICC Track 2. TAPICC Track 2 is going to reuse the OASIS XLIFF OMOS TC specified JLIFF format, the JSON serialization of the XLIFF 2 based object model. This synchronicity of data models will ensure semantic and behavioral interoperability between the asynchronous T1 and real-time transactional T2 bitext interchange.

Translation Memory eXchange (TMX)

Short Description: TMX has been arguably the most important and most widely implemented localization standard format. TMX is a simple XML vocabulary that was designed to provide lossless translation memory (TM) exchange. However, several obstacles prevented TMX from reaching the set goal. Level 1 implementations are too low a common denominator to actually secure lossless interoperability, because of segmentation differences (that should in theory be addressed by SRX) and because of absence of inline markup on Level 1. Level 2 stipulates lossless exchange of native inline codes that are however ignored by many tools and encoded as abstract placeholders. TMX is now far behind industry developments, but it will continue to be important for some time as a legacy format, mainly for collecting MT training corpora from legacy tools and repositories.

Owner: ETSI ISG LIS.

IPR Mode: RF in the LISA published versions, and FRAND in the ETSI ISG LIS version.

Current version and work in progress: There is no current work in progress, as the ETSI ISG LIS group was closed. Unfortunately, OASIS and ETSI were not able to agree on transferring the TMX IP to the OASIS XLIFF OMOS TC that had been chartered to take over the ownership of TMX.
1.4b is the current version with LISA numbering. It is also referred to as ETSI GS LIS 002 V1.4.2 (2013-02). Based on the published executive summary from November 2011, ETSI ISG LIS wished to coordinate TMX 2.0 development with XLIFF 2.0 definition of inline markup codes. During the existence of ETSI ISG LIS there was no technical development on the TMX front. The group only republished the latest LISA version on the ETSI Group Specification template. At FEISGILTT in June 2013 in London the ETSI ISG LIS chair made a public consultation of a possible related work item that ETSI ISG LIS would like to undertake the standardization of fuzzy matching calculations for text segments. At the time of ETSI closing ISG LIS (September 2015), there were informal talks about OASIS XLIFF OMOS TC taking over the TMX ownership, maintenance and development. The XLIFF OMOS TC was chartered and scoped to be able to take over TMX ownership. However, formal IP transfer between ETSI and OASIS has not been negotiated to the present day. Because of the legal matters being stuck, delegates of the June 2016 FEISGILTT and XLIFF Symposium discussed developing an XLIFF 2 profile with mandatory usage of the Translation Candidates module as a maintainable replacement of the TMX functionality.

In 2017, Andrzej Zydroń proposed developing a note within OASIS XLIFF TC that would describe how to use an XLIFF 2 profile instead of the obsolete and unmaintained TMX format. However, this work hasn’t progressed due to lack of interest and technical consensus.

Unicode Bidirectional Algorithm (UAX #9)

Short Description: Default text flow of Arabic and Hebrew scripts is right to left. However, text written in these scripts often contains portions with left-to-right directionality, such as names of companies or products. That is why such text is called bidirectional (bidi). Many characters have strong directionality properties, but there are also characters with weak directionality behavior and neutral characters whose directionality depends on context. In practice, normally invisible control characters (markers) need to be used in order to encode bidi in plain text. Simply put, UAX #9 is a detailed normative account of Unicode bidi behavior (mainly) in plain text.

In theory, the characters that the Unicode Bidirectional Algorithm makes use of to explicitly set text flow direction should not be used within markup context. Instead, the bidi flow control characters should be replaced with appropriate markup controlling the text flow. In practice, many tools ignore directionality markup and apply UAX #9 in full (including the control characters) even in structured and markup environments. This may be due to the fact that UAX #9 has a long tradition (since Unicode 2 in 1996). There is also a standardization gap, as Unicode control and stateful characters are not provided with clear processing requirements to be applied on entering markup environments. UTR #20 unfortunately provides only an abstract guidance rather than clear and unambiguous processing requirements. While XLIFF 2.0 has its own directionality attributes, it does not have attributes corresponding to inline bidi overrides or embeddings, so these are allowed as UAX #9 control characters if needed.

Owner: Unicode Consortium.

IPR Mode: RAND.

Current version and work in progress: Revision 42 was released for Unicode 13.0 on February 2, 2020. UAX #9 is being constantly revised to be up to date with the current Unicode release, and www.unicode.org/reports/tr9 links to the current official version. Additionally, the link www.unicode.org/reports/tr9/proposed.html always points to the latest proposed version if such a proposal exists. Major versions happened between Revision 27 and Revision 29. Revision 29 sent ripples that profoundly influenced handling of bidi text in the world of markup languages. It led to introduction of new direction handling elements and attributes in HTML and XML vocabularies. Because directionality handling in HTML and several XML vocabularies was in flux in 2013, ITS 2.0 does not contain normative provisions for directionality handling. XLIFF 2 does contain up-to-date directionality markup and a valid guidance how to combine it with directionality control characters if necessary when modifying segmentation in compliance with the Revision 35 of UAX #9.

Unicode in XML and other Markup Languages (UTR #20) [Withdrawn]

Short Description: Unicode Technical Reports (UTR) are persistent names that always point to the actual revision. Unicode, as its main target is plain text, contains many control, formatting and other characters. This document gives a normative overview and general guidelines of which characters should and should not be used in markup context. In general, any Unicode character that is XML illegal or would require additional metadata for interpretation should come with a markup handling/replacement recommendation, or processing requirement. Authoring tools, XML editors and browsers are generally encouraged to ignore inappropriate or deprecated Unicode characters, so their preservation on crossing of plain text/markup boundary will often lead to harmful loss of data or metadata. In general, plain text is linear and requires special control characters or specific application behavior to encode metadata and/or styling information that can be handled with structured mark-up in XML or HTML environments.

Owner: W3C (Internationalization Core Working Group). UTR #20 was withdrawn at Unicode in March 2016, and maintenance of the document transferred to W3C.

IPR Mode: W3C (RF).

Current version and work in progress: Now withdrawn both at Unicode Consortium and W3C. Last released as a W3C Working Group Note on July 13 2017. Albeit dated, this is historically an influential document.

No current work is in progress, although W3C accepts and collects Github issues that are being collected against the latest editor’s draft http://w3c.github.io/unicode-xml/, currently dated July 8, 2017, and identical with the last published version. This important note would deserve a refresher for rapid new Unicode releases.

Unicode Locale Data Markup Language (LDML or UTS #35)

Short Description: This specifies an XML vocabulary for encoding locale specific generic data categories — dates, amounts, decimals, units of measure, currency symbols and so on. Its main purpose is to enable the creation and maintenance of CLDR but is also used directly in programming frameworks such as .NET.

Owner: Unicode Consortium.

IPR Mode: RAND.

Current version and work in progress: Version 37, Revision 59, was released on April 15, 2020. Version 38, Revision 60 is proposed here: https://unicode-org.github.io/cldr/ldml/tr35.html.

Unicode Regular Expressions (UTS #18)

Short Description: Unicode Technical Standards (UTS) are persistent names that always point to the actual revision. UTS #18 gives general guidelines for regular expression engines on how to comply with the Unicode Standard. Three levels are specified, of which two are default (one the minimum feasible for programmers, the other more end-user friendly) and the finest is language specific. Algorithms conformant to this specification will give different results for different versions of Unicode

Owner: Unicode Consortium.

IPR Mode: RAND.

Current version and work in progress: Version 21 was released on June 17, 2020. No new revisions have been proposed for at the time of writing.

The URL www.unicode.org/reports/tr18 always points to the current official version. Once a new draft is proposed, www.unicode.org/reports/tr18/proposed.html will link to it.

Unicode Standard

Short Description: Unicode is the core standard that allows humanity to encode all written human languages for computer use. Hundreds of thousands of characters covering alphabetic, syllabic and ideographic scripts and more are ordered in planes along with punctuation, control and private use characters. The Unicode standard has been published since October 1991.

Owner: Unicode Consortium.

IPR Mode: RAND.

Current version and work in progress: Unicode 13.0 was published on March 10, 2020. As of Unicode 7.0, there is an aggressive yearly schedule for major releases. Unicode 8.0 appeared in June 2015, Unicode 9.0 was released June 2016, and so on. Unicode 14 is on track for the first half of 2021.

Unicode 13 added four new scripts, (154 total), 55 new emoji characters, and after adding 5,930 characters in this release, the new total of Unicode characters stands at 143,859. Character groups and whole new scripts continue to be added as per worldwide communities’ requirements. Relatively heavy editorial reshuffle in 7.0 enables the upcoming rapid cycle for major releases. The Unicode core had to adapt to changes introduced in the significant changes in Revision 29 (See UAX #9). Changes in Unicode 11 brought updates to UTS #10 Unicode Collation Algorithm; UTS #39 Unicode Security Mechanisms; UTS #46; and UTS #51 Unicode Emoji.

Each major release of the Unicode standard triggers synchronization of four standalone Unicode technical standards: UTS #10 (Unicode Collation Algorithms), UTS #39 (Unicode Security Mechanisms), and UTS #46 (Unicode IDNA Compatibility Processing, which is compatible processing of non-ASCII URLs), and UTS #51 (Unicode Emoji).

Unicode Text Segmentation (UAX #29)

Short Description: Unicode Standard Annexes (UAX) are persistent names that always point to the actual revision number. UAX #29 is the key normative source of segmentation rules. Apart from sentence boundaries, which are most relevant for computer-aided translation tools interoperability, it defines more basic grapheme cluster and word boundaries. The segmentation rules are given in more or less natural language as an inductive succession of rules. The specification states itself that the same set of rules can be given using regular expressions. Unfortunately, no finite set of regular expression-based rules can ensure 100% successful sentence segmentation of English text, the main reason being the semantic ambiguity of the full stop. Apart from closing sentences, the same character is being used for closing abbreviations, decimal points and so on. Interestingly, in Hebrew this problem virtually does not exist, as Hebrew does not overload the full stop with abbreviation function.

Although UAX #29 cannot possibly achieve completeness, it is still beneficial to implement it as the basic set of rules, and apply more fine-grained exception rules on top of it. The ULI TC does not plan to influence the default UAX #29 segmentation behavior and releases the
locale specific segmentation exceptions as part of CLDR.

Owner: Unicode Consortium.

IPR Mode: RAND.

Current version and work in progress: Revision 33 was released for Unicode 11.0 on May 22, 2018, and www.unicode.org/reports/tr29 always links to the current official version. Additionally, www.unicode.org/reports/tr29/proposed.html always point to the latest proposed version, as long as one exists. The Revision 34 for Unicode 12.0 has not been proposed yet at the time of writing.

ULI exceptions use the UAX #29 behavior as a baseline. UAX #29 contains an informative pointer to CLDR released segmentation exceptions developed by ULI TC that can be used to modify the locale independent baseline segmentation behavior described in UAX #29.

Universal Terminology Exchange (UTX)

Short Description: UTX is a simple bilingual glossary format that was originally targeting MT training. It is a simple tab delimited format, and an XML version also exists, incited by criticism of the localization standardization community. UTX embedding has been specified for XLIFF:doc. TBX-Min mapping was proposed by LTAC Global. Although rather minimal, this tiny terminology exchange standard specifies the part-of-speech field as mandatory and also provides an optional field for term status tracking with predefined values.

Owner: The Asia-Pacific Association for Machine Translation (AAMT). Although the AAMT is not a traditional standardization body, their standardization working groups own and maintain UTX.

IPR Mode: The documentation is available under Creative Commons 4.0. No IPR mode has been specified.

Current version and work in progress: UTX 1.2 Minimal Specification was released February 18, 2018. There are some header changes, and the 1.2 version supports multiple sub-glossaries as part of one glossary. Editing in spreadsheet software became easier. As far as work in progress, the latest developments consisted in the development of TBX-Min mapping by LTAC Global (see TBX).

XLIFF Object Model (XLIFF OM)

Short Description: An abstract object model defined in UML diagrams and a prose specification. Definition of an XLIFF equivalent data model but also of interoperable file, subfile, and unit fragments. This requires detailed specification of data integrity dependencies that was not needed in XLIFF 2, and that is always being exchanged as a complete XML file.

The main goal is to be able to transfer the XLIFF inline data model into arbitrary XML and non-XML serializations. As such this is groundwork on an interchange data model that is needed to ground any messaging, bus or webservices architecture or a standard Translation API (see TAPICC).

Owner: OASIS XLIFF OMOS TC.

IPR Mode: Non-Assertion (RF).

Current work in progress: XLIFF OMOS TC was chartered in December 2015 to address the industry need of an abstract object model for the XLIFF 2 family of standards. The objective is to provide a serialization independent description of XLIFF and XLIFF fragment data categories that would allow development of non-XML serializations of the same data model. The goal is to make sure that arbitrary serializations of localization interchange data will be interoperable with XLIFF 2, vice versa and among each other. The object model is being developed on a GitHub repository at https://github.com/oasis-tcs/xliff-omos-om. The repository is public but the contributors need to be or become OASIS XLIFF OMOS TC members. Anyone can raise issues or comments via the associated issues or wiki, as well as minor bug fixes via pull requests. It’s up to the repository maintainers if those pull requests will be merged or not, which is the usual GitHub collaboration workflow. UML diagram development on that repository is powered by the open source Papyrus project.

XML-based Text Memory (xml:tm)

Short Description: xml:tm is a namespace application, which means it is not designed to form an independent document that could exist on its own. Instead, it is designed to be injected as a relatively heavy explicit internationalization apparatus into any well-formed XML document containing human readable language. Unfortunately, it is hardly possible to call this specification a standard due to a very low number of implementations — two, to be exact. The standard is being pushed by only one company without wider industry consensus. It was developed by XTM’s Andrzej Zydroń and donated to LISA, which published it as an OSCAR standard in early 2007. Its failure to become an actual standard should be a memento of the importance of broad consensus building while creating industry standards.

Owner: ETSI ISG LIS.

IPR Mode: RF in LISA published versions and FRAND.

Current version and work in progress: Zydroń has exposed the 2.0 version for public comment on XTM International’s web page. Its ETSI status is unclear from publicly available sources. The last version by LISA, 1.0, was released on February 26, 2007.

XML Localization Interchange File Format (XLIFF)

Short Description: In February 2018, XLIFF TC delivered the first “dot” release of XLIFF 2, which is backwards compatible with XLIFF 2.0. XLIFF 2.1 provides full ITS 2.0 support and advanced validation support. The 2.1 version also deprecated the Change Tracking Module and provided a number of bug fixes in both its core and modules. Interestingly and usefully, XLIFF 2.0 documents are forward-compatible with the XLIFF 2.1 advanced validation artifacts.

The XLIFF 2.0 core covers about 20% of XLIFF 1.2, because in 1.2 everything was core and hence the core was too big to be implemented across the industry.

XLIFF 2.0 and 2.1 were created and approved by a representative group of industry and academic standardizers: big enterprise translation buyers (such as IBM, Microsoft and Oracle), large language service providers (such as SDL and Lionbridge), toolmakers (such as MultiCorpora and ENLASO), industry associations (PSBT, GALA, TAUS), academics (LRC at the University of Limerick) and individuals (notably chair Bryan Schnabel). Many more took interest in the final OASIS organizational approval rounds and during the extensive public reviews.
XLIFF 2 is a lean modular standard that allows for plug and play interoperability in the areas of inline markup, segmentation, glossary exchange, translation and review, engineering quality assurance and so on, effectively catering for an end-to-end mashup of best-of-breed solutions throughout the content value chain.

The translation candidates module allows for local inclusion of translation candidates coming from various sources including translation memories, MT services and various crowdsourcing scenarios. The glossary module allows for local inclusion of relevant terminology in both source and target languages. It supports both monolingual and bilingual scenarios, including terminology life cycle management scenarios and one-to-one mapping with TBX-Basic and UTX is possible. The formatting style module provides two attributes for embedding HTML-encoded preview information that can be used by agents for on-the-fly preview generation to provide context for human translators. The metadata module is a non-namespaced private extensibility mechanism. Custom metadata can be grouped and presented as key-value pairs, which is a predictable way of facilitating interoperable display. The resource data module replaces the binary module capability from 1.2. The change tracking module allows users to log change history, including the provenance of changes. The size and length restriction module provides a generalized way for specifying text size/volume restrictions. This module can cater to simple use cases, such as fitting a database field based on Unicode code point or byte count, and it also caters to complex scenarios, such as fitting specific hardware display restrictions using a specific font. The validation module provides a mechanism for specifying simple localization rules to be checked with relation to source and target, to ensure that a brand name will be included in the target text, for example.

Owner: OASIS XLIFF TC.

IPR Mode: RF on RAND terms in OASIS (freely available). RAND in ISO (sold).

Current version and work in progress: Version 2.1 was published as an OASIS standard on February 13, 2018.

ISO 21720:2017 – XLIFF (XML Localization interchange file format) is identical to the OASIS XLIFF Version 2.0 from August 5, 2014.

After successfully delivering the ITS Module and advanced validation capabilities as part of the 2.1 version, the TC started publicly tracking feature development for Version 2.2. Development of XLIFF Version 2.2 has been transferred to Github at https://github.com/oasis-tcs/xliff-xliff-22.

The biggest and most interesting feature proposed for Version 2.2 is Rendering Requirements. A detailed proposal of XLIFF Rendering Requirements was presented at the 40th edition of the ASLING Translating and the Computer Conference on November 16, 2018. The idea is to give possibly normative guidance to CAT tool developers on how to display the XLIFF bitext and its rich metadata, but also to lower the technology complexity of rendering bitext and make interactive rendering of XLIFF possible in web browsers. Other proposed features include specification of semantic domains for placeholders, and roundtrip of segmentation metadata.

The XLIFF TC is also wrestling with the idea of how to use XLIFF for TMX, since TMX is in legal limbo and cannot be maintained or further developed. Three technical options are under consideration: 1) how to informatively describe an XLIFF 2.0 profile that only uses XLIFF 2 Core and the Translation Candidates Module; 2) describe how to use XLIFF Core storage as a translation memory (TM); 3) develop a new grammar that would reuse the XLIFF 2.0 vocabularies but define a different structure that would be more suitable for bulk exchange of TMs, including in multilingual scenarios.

The work on TBX Basic mapping both to and from XLIFF 2.0 (and higher) was transferred to the XLIFF OMOS TC.

The underlying research for this Reader was supported by the Science Foundation Ireland as part of the ADAPT Centre at Trinity College Dublin.

Back to Issue

A Few Terms

Weekly Newsletter, Subscribe to stay updated!

Login or Register