GMX-V: Slaying the word count dragon

By Andrzej Zydroń June 29, 2014

One of the most enduring features of the localization industry has been the inconsistency of word counts, not only between rival products, but also sometimes between different versions of the same product. Trying to establish a measure for the size of a given localization task is not unlike trying to fight a many-headed dragon — with Asian languages that use different writing systems providing additional challenges.

The havoc that the lack of a uniform system of measurement can cause was exemplified in 1999 when the Mars Climate Orbiter Spacecraft was lost because one NASA team used imperial units for a key spacecraft operation, while another used metric units. The total cost of this error was $125 million. Trying to cope with a lack of a common definition for estimating the size of a localization task can be equally catastrophic.

This lack of a unified count is similar to the situation for general measurements before the advent of the French Revolution. A French foot (pied du roi, 12.79 inches) was different from an English foot (12 inches) as was the Welsh foot (9 inches). Certainly the French appendage was the larger. The basis of the current imperial linear measures in England were unified in 1308 by Edward I, who ordained (in a highly scientific manner for the fourteenth century) that an inch was to be three grains of barley, dry and round, taken from the middle of the ear, and that twelve inches were to make a foot. I have often suspected that many of the metrics produced by current computer-aided translation tools use similar formulas based on their output. It took the French Revolution to provide a (mostly) logical approach to establishing general units of measure based on a decimal scale — although somehow the ten-day week did not catch on.

Some ask why we don’t just use Microsoft Word as the basis for word and character counts. However, this in itself is deeply flawed. Microsoft does not, to the best of my knowledge, publish the basis of its word counts, so there is no way of independently verifying them. Even within Word, the counts that are produced do not reflect the actual workload for the translation of a document: Word includes automatically generated text such as table of contents, indexes and so on in the word counts. It does not include header and footer text. Word also counts automatically generated numeric list items (such as 1, 2 and 3) as words.

Additionally, the basis of Microsoft Word counts have, in the past, changed between versions. This obviously results in a lack of consistency and continuity, even between the latest versions of Word.

The broader question becomes how you conduct a word count for non-Word documents, such as complex XML, HTML or FrameMaker documents. This leads to more questions: how do you count hyphenated words? How do you count aujourd’hui or quelque’un m’a dit in French? And last but not least, consider how you would count <gid=”g1”>exa<x id=”x1”/>mple</g>in Word.

Standards to the rescue

GMX-V (Global Information Management Metrics – Volume) is a European Telecommunications Standards Institute Localization Industry Standards (ETSI LIS) specification. Originally developed within LISA OSCAR, it has been incorporated along with the other LISA OSCAR standards within ETSI LIS, where it has been developed further. GMX-V version 2.0 was published in 2012 and includes factors for converting Chinese, Japanese, Korean and Thai character counts to word counts. Full details of GMX-V are available from ETSI LIS at http://goo.gl/U1IIdQ.

GMX was always intended to be a group of standards relating to providing key standard metrics, such as P for percentage fuzzy match, C for complexity and Q for required quality. Using GMX-V/P/Q/C, you can quantify and automate the quoting for a localization task. GMX-V provides a clear and unambiguous way of counting as well as categorizing word and character counts for all languages and scripts. It also offers an XML vocabulary for exchanging localization metrics data between computer systems.

GMX-V addresses two very important issues: how you unambiguously and verifiably count words and characters for a given localization task, and how you exchange word and character counts in a uniform and rigorous form between systems. Interestingly, for a document containing only text, without any header, footer, table of contents and so on, GMX-V produces word counts that are not dissimilar to Microsoft Word, but it does so in a documented and verifiable form.

GMX-V mandates both word and character counts. Character counts convey the most precise definition of a translation task, whereas word counts are the most commonly used metric in the translation industry. GMX-V encompasses both measurements, thus affording the translation suppliers and customers with a choice as to which measurement most adequately reflects the translation task in question.

One of the main problems with calculating word and character counts is the plethora of differing proprietary file formats, which can contain a mix of form and content data. Trying to establish a standard that addresses all of these formats is impossible — the word count dragon has too many heads to attempt to cut them all off with one swipe. As soon as one head is cut off, a new one will appear somewhere else. A better approach is to force the dragon to enter a narrow passage where the heads are all forced together. Enter the XLIFF knight riding in on a charger called Unicode.

XLIFF is the OASIS standard for XML Localization Interchange File Format, and is designed as a way of exchanging translatable data in an XML format. GMX-V relies on the XLIFF representation as the canonical form for the basis of word and character counts. GMX-V mandates that all characters are counted in their Unicode representation and that all multiple space characters are reduced to a single character. In addition, word boundaries are defined with reference to Unicode Technical report 29 – Text Boundaries, also known as Unicode TR29. This provides an unambiguous definition of what constitutes a word. By using XLIFF as the canonical form for counting the source language text, GMX-V establishes a common and well-defined format for word and character counts. GMX-V uses the XLIFF source element for the canonical form. Example:

<source>An example of the canonical form of a text unit.</source>

Within XLIFF, inline codes are interpreted as inline XML elements. The inline elements are not included in the word and character counts, but form a separate inline element count of their own. The frequency of inline elements can have an impact on the translation workload, so a separate count is useful when sizing up a job. For the canonical form, only g (inline elements with content) and x (inline elements with no content) inline elements are used. Example:

<source>In this <g id=”g1”>example</g> the in-line codes do not feature in the word and character counts.</source>

<source>In this <g id=”g1”>exa<x id=”x1”/>mple</g> the in-line codes do not feature in the word and character counts.</source>

Standalone punctuation characters also feature as an additional category in both word and character counts. They are included in the main count, but can be deducted from both by mutual consent if they do not increase the translation workload.

GMX-V addresses all of the issues of how to count words and characters in the XLIFF canonical format. GMX-V proposes a sentence level of granularity for counting purposes within XLIFF. The sentence is the common accepted atomic unit for translation.

GMX-V does not preclude producing metrics directly from non-XLIFF format files as long as the format for counting is based on the XLIFF canonical form for each text unit being counted. This can be done dynamically on the fly. In these instances an audit file will be necessary for verification purposes.

The main goal of GMX-V is to provide a detailed count for words and characters based on the characteristics of individual sentences. The aim is to provide sufficient detail to enable an accurate definition of the scale of the translation task. The customer and supplier can then decide which of the statistics to use or not when costing the translation task for a given file.

Another aspect to the symbiotic relationship between GMX-V and XLIFF is that GMX-V counts can be embedded in an XLIFF document, thereby exchanging not only the text to be translated, but also the word and character counts.

As previously mentioned, version 2.0 of GMX-V added word count support for Japanese, Chinese, Korean and Thai. Word counts for these languages are based on factors applied to character counts. These factors are well established in the localization industry and have been used over many years. You divide the character counts by the following factor for each script to obtain the word counts:

Chinese (all forms): 2.8

Japanese: 3.0

Korean: 3.3

Thai: 6.0

For instance, if a Chinese document contains 13,456 characters (using the GMX-V specification) this would be divided by 2.8, and the resulting count will be 4,806 words.

Quantitative and qualitative measurements

GMX-V counts fall into two categories — how many, and what type. The primary count will always be unqualified — how many characters and words there are in the file. This is the minimal conformance level proposed for GMX-V.

A typical translatable document will contain a variety of text elements. Some of these elements will contain non-translatable text, some will have been matched from translation memory and some will have been fuzzy matched by the customer. It is therefore important to be able to categorize the word and character counts according to type in order to provide a figure in words and characters for the specific localization task.

GMX-V recommends the following count categories:

Total Count: the overall count.

Exact Matched Count: this is an accumulation of the word and character count for text units that have been matched unambiguously with a prior translation and require no translator input.

Leveraged Matched Count: this is an accumulation of the word and character count for text units that have been matched against a leveraged translation memory database.

Fuzzy Matched Count: this is an accumulation of the word and character count for text units that have been fuzzy matched against a leveraged translation memory database.

Alphanumeric Only Text Unit Count: this is an accumulation of the word and character count for text units that have been identified as containing only alphanumeric words.

Numeric Only Text Unit Count: this is an accumulation of the word and character count for text units that have been identified as containing only numeric words.

Punctuation Only Text Unit Count: this is an accumulation of the word and character count for text units that have been identified as containing only punctuation.

Stand Alone Punctuation Count: this is an accumulation of the standalone punctuation word and character counts from the individual text units that make up the document.

Measurement Only Count: this is an accumulation of the word and character count from measurement only text units.

Other Non-Translatable Word Count: as the title suggests, this relates to other nontranslatable word and character counts.

Similar counts exist for characters. For this reason, and in summary, GMX-V provides unambiguous and verifiable counts for words and characters, standalone punctuation, inline code and references for all languages and scripts. It also provides additional qualitative counts for the text element categories detailed above.

GMX-V is based on well-defined standards — XLIFF, Unicode ISO 10646 and Unicode TR29. All of this detail allows a precise and unambiguous definition of the localization task for a given electronic file. This rich detail allows suppliers and customers to be able to precisely measure the task in hand. This must surely be a good thing for the localization industry as a whole. In addition, GMX-V provides a way of electronically exchanging counts between different systems.