How to create glossaries in UTX

By Yuji Yamamoto May 24, 2011

A well-made glossary is useful regardless of the type of your commitment to a translation project. Have you ever wanted to have a simple, useful glossary for your translation project? If so, UTX may be the answer. UTX (Universal Terminological eXchange) is a simple, open glossary/dictionary format developed by the Asia-Pacific Association for Machine Translation (AAMT). It is written as a tab-delimited UTF-8 text, which allows casual, rapid compilation of glossaries. Some might consider this as a step backward, when XML is much more powerful. But consider this — do we always have time to create an XML glossary for a translation project that has less than 50,000 words? We still need a glossary for this size of project.

There are clear and rational reasons to use tab-delimited UTF-8 text. While TBX and TBX-Basic are rich and powerful, preparing data for these formats may require experts with a deep understanding of these specifications, XML, linguistics and so on. UTX can be created “on the field” by any translators through text editors and spreadsheets, requiring minimal knowledge of the specification and no knowledge of XML. With an incorporated terminological management mechanism, UTX leverages the power of customer generated media, or collective efforts to create a domain-specific glossary by accumulating terminological entry contributions.

While UTX serves as a practical human-readable glossary, it was primarily developed in Japan as a machine-readable dictionary. The translation industry in Japan faces many tough institutional, technical and moral issues. Despite the unique variety of writing systems (hiragana, katakana, kanji and roman characters), the Japanese language lacks well-established writing standards such as the Chicago Manual of Style. There is no national translation certification. The awareness for reusing and sharing translation assets is generally low. People are reluctant to try new technologies in business.

Interestingly, however, almost all major electronics brands have developed machine translation (MT) systems in Japan. In fact, the Japanese might currently have some of the best translation software on the market. All high-end Japanese translation software is heavily equipped with a variety of options to assist translators for commercial translation. Unfortunately, this does not mean that they are perfectly ready for translators, or they are widely used by translators of Japanese. In fact, they are only used by few translators, and not always in the most effective way.

As an effort to streamline the translation process, integration of MT and translation memory (TM) has been attempted in many forms so far. However, many still have room for improvements. I personally have been using MT for my translation for 11 years. With an MT-assist only environment, I am able to translate at the rate of 650 source English words into Japanese per hour, without using any TM segments or repetitions. This is about twice as fast as those who translate without MT. Furthermore, if I have TM available, I will get all the benefits of the TM tool to the last drop. Many think that statistical MT (SMT) is the next big thing, but I am a firm proponent of rule-based MT (RBMT). In RBMT, translation software produces translation based on terminology (glossaries/dictionaries) and rules based on grammar. I have concerns about the prevalence of SMT in recent years. With statistical patterns, some language pairs show remarkable fluency on the surface. SMT, however, gives you extra work on post-edits, with no apparent pattern. Terminology management must exist outside of an SMT system as an addition. In RBMT, on the other hand, terminology management is an integral part of the system. Even when it makes errors, RBMT will give you usable and controllable results. Then you can quickly improve them through feedback of the user dictionary.

Background of UTX

The latest specification of UTX (formerly known as UTX-Simple) version 1.11 is the result of the collective work by the UTX team. When using rule-based translation software in a computer-aided translation workflow, specialized terminology and names of people and places in the source document are often not included in basic system dictionaries, and they are not translated as well as one would expect. It is well established, however, that if terms are well-chosen and appropriate for a specific domain, core terminological information is sufficient to increase the adequacy and accuracy of MT. Unfortunately, user-created dictionaries are often incompatible across different MT systems, rendering the effort to create such dictionaries futile. To address this issue, AAMT has undertaken to establish a set of specifications for sharable dictionaries, which can be used across different MT systems. AAMT created its first version of specification, UPF (Universal PlatForm), with support from Information-technology Promotion Agency, an institute in Japan, in 1995. In 2006, AAMT started to create new specifications to reflect and incorporate the subsequent advancement of technology and the changing usage of MT. In 2007, the new format received the new name UTX, short for universal terminology eXchange. UTX is an open standard. In 2009, AAMT established UTX-Simple, which was intended to be the simple, tab-delimited version of UTX. As AAMT learned that more complex functions can be achieved by TBX and TBX-Basic, it has focused on developing UTX-Simple. In April 2011, UTX-Simple changed its name to UTX, dropping –Simple.

AAMT also produces and collects open user dictionary data for specialized domains. AAMT also hopes to create a user community for generating, sharing and accumulating user dictionaries in a sustainable way.

The goal of UTX is to create from a simple, easy-to-make, easy-to-use dictionary that can be used by MT systems. UTX places emphasis on usability and simplicity over advanced manageability and lossless conversion.

The same UTX dictionary can be used by different manufacturers’ translation software. In addition, a UTX dictionary is human-readable, and can be used as a glossary that does not involve translation software at all.

When a user of translation software makes the effort to prepare user dictionaries, they are fragmented and dispersed, and thus not effective. Also, even a simple plain text file is difficult to share or to reuse, unless its format is standardized. However, if a single UTX standard is adopted, shared dictionaries can be used widely across various tools, such as translation software from different manufacturers, and are also highly reusable.

UTX is suitable to be used, for example, in the process of rapid compilation of glossaries from multiple resources, or distribution and reuse of glossaries across a wide range of applications. For a more complete, long-term terminological management, TBX may be suitable.

UTX is specifically designed to be used by end users of translation software. UTX does not require any advanced technical knowledge of linguistics, grammar, XML or MT software to create, edit or use. UTX can be made from a minimum of data for both source and target language.

UTX can be used in any domain of translation, but should be specific to some subject or topic, such as ICT, medicine, law or engineering. Ideally, a domain should be highly specific, such as “cardiovascular surgery.” If a dictionary contains distinctly different domains, each domain should form a separate dictionary. It is easier to manage dictionaries this way, and dictionaries can be easily combined.

UTX may not be suitable for translation of non-specialized, general contents. When used for general contents, the benefit of UTX is limited. The framework of UTX assumes that the target MT system already has a well-developed system dictionary. UTX can increase the adequacy of translation where the existing system dictionary cannot. Non-technical terms can only be included in a UTX dictionary when they have specific meanings and target terms within the domain.

Dictionary information

The first line of a UTX file consists of necessary information about the UTX file, delimited by semicolons. It is specified as follows: #UTX <version>; <source language >â€¨/<target language>; <date created>; <creator>; <license>; <bidirectionalâ€¨(optional)>; <dictionary ID (optional)>; <other optional fields>

Source language/target language: ISO 639 and 3166 formats. For example, American English would be en-US and Japanese would be ja-JP. This is similar to language tags in HTML and XML: www.w3.org/International/articles/language-tags/Overview.en.php.

Date created: ISO 8601 format. For example, 14:28 (00 seconds), April 10, 2011, in Japan Standard Time (GMT plus 9 hours) is represented as:

2011-04-10T14:28:00Z+09:00

The license of the dictionary can be declared in the form of Creative Commons, public domain or other forms of license. It is strongly recommended to clarify how the dictionary can be shared and used. When all terms of a dictionary are known to be used in the opposite translation direction, the dictionary can have a “bidirectional” flag in its header. In this case, individual term status does not need to be indicated, because all terms are assumed to be approved.

The following is an example of a header: #UTX 1.11; en-US/ja-JP; 2011-04-15T10:00:00Z+09:00; copyright: AAMT (2011); license: CC-by 3.0; bidirectional

A dictionary ID is a unique identifier for a dictionary. It consists of four case-insensitive alphanumeric characters chosen by the dictionary administrator. It may be required to distinguish dictionaries when multiple dictionaries are merged. If two dictionaries happened to share entries with the same concept ID, unrelated entries could be grouped into one group of entries. Unique dictionary IDs can help avoid such a situation. A dictionary ID is not mandatory. It can be added when multiple dictionaries need to be merged (Table 1).

Column definitions and body

The second line, or if there are additional descriptive lines, the last line of a UTX header (also begins with #) includes a set of column definitions. Column definitions consist of three mandatory columns, followed by optional user-defined columns, separated by tab characters. The column definitions and the body are closely related.

The body of a UTX consists of entries in each logical line. The first column, marked src, contains source terms, that is, terms in the source language.

The second column, tgt, contains target terms, that is, terms in the target language. The target language column is mandatory; however, in the case of a monolingual dictionary, it can be left blank.

The third column, src:pos, contains parts of speech for the source term. UTX has the following parts of speech: noun, proper noun, verb, adjective, adverb and sentence. If the part of speech is unknown then leave it blank. Sentence should only be used when necessary. Entries of translated sentence pairs should be stored in a TM rather than a dictionary.

An entry can optionally have one of four term statuses (provisional, approved, non-standard or forbidden) to indicate the terminological state of the term. Term status is only managed for the primary translation direction of a dictionary. If term status needs to be managed for the reversed translation direction, a separate dictionary should be compiled.

The term status provisional means that an entry is entered into a dictionary by a contributor but not yet checked by the dictionary administrator. As a provisional status is temporary, the dictionary administrator is expected to promptly decide if the term should be any of approved, non-standard or forbidden. The dictionary administrator may also choose to exclude (delete) the term from the dictionary.

The term status approved means that an entry has been approved by the dictionary administrator. An approved status indicates that the term must be used. The rationale could vary, but usually because it is a technical term within a specific domain or it belongs to a glossary of an organization. If the word form of the term has variations, such as plug-in and plugin, only the approved form should be used.

When there is a clear reason, a source term can have multiple target terms (thus multiple entries), but only in one entry is its term status approved.

If its term status is approved, a term can be reversed to be used for the opposite of the translation direction that is defined in a dictionary.

An approved term is always bidirectional. When there are multiple translations to a source term, and a user chooses to use them for a reversed translation direction, an approved term will be the only valid term.

If the dictionary has a single contributor, the contributor (who is also the dictionary administrator) may choose to assign the approved status immediately after adding an entry to the dictionary. Alternatively, the contributor may also choose to leave the status blank or assign provisional status, until he or she can confirm that the new entry works fine in the translation project.

The term status non-standard indicates one or more non-standard source terms. Non-standard terms are only permitted to accommodate variations of source terms. Non-standard terms should not be used as target terms. If a UTX dictionary is used as a glossary for authoring of documents (rather than translation), non-standard terms should not be used as terms, because they are entered in the dictionary so that an MT system can translate even if the author of the source document used improper words that are not approved.

The term status forbidden means that an entry includes a target term which should not be used. Such words are explicitly forbidden from linguistic, social, terminological, branding, or other viewpoints. A target term may also need to be suppressed to avoid conflict with different domain-specific dictionaries, when a translation tool does not properly honor the priorities among multiple dictionaries.

A forbidden term is usually accompanied by a term that is not forbidden — an approved term, a provisional term or a term without a term status. For example, in the context of ICT, the English term window is very unlikely to be translated as the Japanese word éµ», which means an opening in a wall. In the context of ICT, the Japanese word å’»å’¯æ‰å®¥å’» must be used instead to refer to a certain area on a computer screen. The term éµ» may need to be explicitly suppressed if the MT system cannot handle it appropriately.

A forbidden term could have an approved status in a reversed entry. Also, a non-standard term could have forbidden status in a reversed entry (for example, if the concept ID is the same).

It is preferable that a translation tool has a mechanism to suppress the use of forbidden terms systematically. It is possible that a term is forbidden in one dictionary, but if the same term exists in another dictionary, it may not be forbidden. In other words, when using a set of dictionaries in a translation tool, the forbidden status may differ among the dictionaries. A translation tool or a UTX converter tool should preferably have a mechanism to detect such conflict. Forbidden terms can be extracted to be used for terminological checks outside of a translation tool.

Concept ID and other

optional columns

When there are multiple entries with different term statuses, use a concept ID to indicate that they share the same concept. If a source term has only one target, concept ID is not required.

A concept ID consists of serial, unique numbers within a dictionary. A concept ID must be a numeric value of up to ten digits. When multiple dictionaries are merged, entries with the same concept ID can be distinguished by their dictionary IDs.

In the example in Table 2, term numbers 1, 2 and 3 point to the same concept. So do 6 and 7. In 4 and 5, the concept ID is blank because they follow the “one word, one meaning” principle; therefore, no other entries exist that need to be distinguished. Note that the word å“ˆå’»å®¢æ¢å§»å®¢ is forbidden as the target term of outlet, but it is approved as a part of the word å“ˆå’»å®¢æ¢å§»å®¢ å§˜å®¢å“ˆ. Also note that where multiple entries share the same concept ID, there is only one approved term. All others are either forbidden or non-standard.

The fifth column and any other additional columns are optional; a user can define as much information as he or she wants to.

UTX guidelines

UTX works better if MT systems have the following functions: A sound system dictionary for non-technical terms; capability of using a set of multiple user dictionaries; priority of longer compound words over shorter words; and the capability of suppressing the use of forbidden entries.

In general, a UTX dictionary should only contain technical terms of a specific domain. In most cases, entries are nouns, especially compound nouns. Translation accuracy can be improved by collecting, sharing and reusing the data of fine-tuned bilingual translations that are not included in translation software out-of-box. Sentences should not usually be included in a UTX dictionary, except when it is appropriate to treat them as “words.” As a rule, UTX should be separated from the TM, which is a bilingual database of sentences, rather than words.

For example, a term like XML declaration can be correctly translated into its Japanese equivalent, XML é‰§æ§¶, by just registering it in a user dictionary. Basic vocabulary like window should not be included, because such a word is already contained in the system dictionaries of translation software. Add only one translation for each entry, avoid basic words in the system dictionary, and define the specific domain of the dictionary clearly. The basic form of the word should be entered — singular form for a noun, root form for a verb, as you would see in a commercial dictionary. Any comments should be noted separately in the comment field, not as a part of the entry. Choose only the single most appropriate translation corresponding to a source term. If it has multiple distinctly different meanings, they can be treated as separate entries. Do not add words that are dependent on a specific MT system. Alphabets and numbers should be written in single-byte characters, not multi-byte characters. Do not use ellipsis (…) to indicate a variable within an entry. Do not add any comments directly in a mandatory field; add a comment by either adding a comment column in the dictionary table or by adding a comment line that begins with #.

In English, always begin an entry with lowercase, except in the case of proper nouns. Do not include articles such as a, an and the, except in the special case that they are part of a proper noun.

Conclusion

The translation industry already has useful technologies. What it lacks is training and knowledge. UTX will be the key to connect MT and TM technology more smoothly. I encourage companies with an R&D team to try RBMT with UTX. As an optimal greatest common devisor, UTX provides a versatile platform to reuse, manage, and exchange essential terminological information across different MT/terminological tools. It can be used in many CAT and terminological tools, as well as in spreadsheets or text editors.

UTX is not intended to replace any existing terminology formats such as TBX and TBX-Basic. They would be better for larger, long-term management of terminology. Instead, UTX is designed to supplement TBX and TBX-Basic in the scenarios where these formats are too complicated to implement. UTX can also serve as a “bridge,” filling the gap between various spreadsheet-based glossaries and TBX/TBX-Basic.

A converter has been developed, thanks to Professor Alan K. Melby and his team, and is available at www.ttt.org/tbxg. It can convert among several glossary formats, including TBX-Glossary and UTX.

UTX is an open standard. It is still young, and it welcomes your feedback. The full specification is available from AAMT: www.aamt.info/english/utx.