Technology

Localization engineering for translators

Freeware and
open-source tools

Christine Bruckner holds university degrees in translation and computational linguistics. She has worked in the localization industry for over 25 years, initially as a freelance translator, and since 2001 as a language technology specialist and localization engineer for corporate and government language services as well as LSPs. She would like to thank her friend and former localization engineering colleague Maria Finnegan for her input and help in writing this article.

Image of Christine Bruckner

Christine Bruckner

Image of Christine Bruckner

Christine Bruckner

Christine Bruckner holds university degrees in translation and computational linguistics. She has worked in the localization industry for over 25 years, initially as a freelance translator, and since 2001 as a language technology specialist and localization engineer for corporate and government language services as well as LSPs. She would like to thank her friend and former localization engineering colleague Maria Finnegan for her input and help in writing this article.

Localization engineering refers to, among other things, the file preparation and post-processing steps typically carried out by specialists either before or after the actual translation is completed. Responsibilities of localization engineers may also include terminology and translation memory optimizations. To perform their tasks, localization engineers are usually expected to be proficient in Java, Visual Basic, Perl, Python or similar programming languages. However, there are several freeware and open-source localization tools that enable localization tasks to be completed without any programming or scripting skills at all. And these tools can also be leveraged by tech-savvy translators and translation project managers.

Localization engineering challenges that translators and translation project managers are often faced with include:

  • Merge or split TMX files
  • Clean up a translation memory (TM) with multiple target units for the same source unit
  • Remove TMX tags and their content from multiple TMs
  • Generate TMX files from bilingual files for import into a TM
  • Prepare multilingual glossaries for import into a termbase
  • Pre-process XML files using regular expressions and store these workflows for subsequent reuse

These tasks can be completed using the desktop-based tools in Figure 1, which are among the most popular freeware and open-source applications available for localization professionals. For download information, see the “Links” box.

Table of Popular tools for translators

Figure 1: Popular tools for translators.

Pre- and post-processing localization files

Many of the file formats used in software localization are text-based — for example .properties, .json, variations of .csv, or different flavors of HTML or XML (.xliff, .resx). Although computer-assisted translation (CAT) tools and translation management systems (TMS) provide built-in support for translating most of these file formats, the corresponding parser settings often need to be customized, or the files have to pre-processed in order to properly protect nontranslatable content.

In order to facilitate preparation and post-processing of such files, you can use localization tools like those mentioned here for converting specific input formats to localization interchange formats (TMX, XLIFF), performing search and replace operations and XML verification, comparing different file versions and so on.

Terms

eXtensible Markup Language (XML): a programming language pared down from SGML and designed especially for web documents.

Translation Memory eXchange (TMX): based on XML, an open standard that has been designed to simplify and automate the process of converting translation memories from one format to another.

Image of Files dialogue and Plugins in Notepad++

Figure 2: Find in Files dialogue and Plugins in Notepad++.

Notepad++ enables find and replace operations in files and folders (Figure 2) using regular expressions and has very helpful extensions like the Compare and XML Tools plugins.

Okapi Rainbow (Figure 3) supports batch processing of multiple files as well as reusing and combining steps into pipelines: the XML Analysis pipeline, for example, generates a report that identifies XML elements and attributes that actually contain text.

Image of Convert File Format

Figure 3: Convert File Format pipeline in Okapi Rainbow.

However, when performing file manipulations, you should not forget to keep backups, and to use sanity checks, roundtrips, pseudo-translations and so on, as files might become corrupted or modified inconsistently, especially when processing large files on less powerful computers.

Preparing terminology and glossaries

Most commercial terminology tools provide functionality for converting word lists and terminology files — from Excel or CSV lists, just for example — into the terminology exchange format TBX.

However, freeware and open-source tools can provide complementary features.

For example, OpenOffice Calc can be a good choice for cleaning up and preparing glossaries in .csv and .txt formats for subsequent import into a termbase or upload to machine translation (MT) dictionaries. Its main advantage over Microsoft Excel is that OpenOffice Calc supports the use of regular expressions in its filtering and search and replace options.

And using the Convert File Format pipeline in Okapi Rainbow, you can convert glossaries from .csv to TMX for use as a (reference) translation memory.

Converting and maintaining TMs

Proper maintenance of TMs — both in terms of linguistic content and metadata — is a key factor in ensuring efficient and consistent translation.

When analyzing TMs and making decisions about inconsistent phrasing, terminology harmonization and similar linguistic quality issues, best practice according to the 2018 GILT Leaders Forum is usually that “A linguist should determine if a change or removal is warranted.… Use tools, scripts & macros to run quality checks on TMs, then fix manually.” The preferred open-source tools for TM conversion and maintenance according to GILT are Olifant, Okapi Rainbow and Checkmate. These tools do not only allow conversions to and from TMX (Figure 4), they also support technical and linguistic maintenance that goes beyond simple search and replacement operations:

Using Okapi Rainbow, you can convert bilingual content from table-style formats to TMX, and, using its Table Filter configurations, transform nontranslatable elements into inline codes.

Image of Convert File Format
Image of TMX Code Removal

Figure 4: TMX Code Removal options in Okapi Rainbow and in Olifant.

In Heartsome TMX Editor, Olifant or Okapi Rainbow, you have several options for removing TMX tags and/or their content.

Heartsome TMX Editor allows you to validate, split and merge TMX files.

Both Heartsome TMX Editor and Olifant allow you to view and edit TMX files and provide options which help to identify and optionally cleanup inconsistencies in source and/or target segments.

Potential linguistic issues such as repeated words, corrupted characters, inline code differences, differences in length between source and target, missing translations and so on can be analyzed in Okapi Checkmate or via the Quality Check step in Okapi Rainbow.

Preparing TMs for MT training

TMs and termbases are also your most important assets when training and customizing MT engines. However, clean and consistent training data is required, particularly in the case of neural MT. Localization engineers in MT teams often develop their own scripts and pipelines for data inspection and cleanup of TMs, but they also use open-source and freeware tools for these purposes.

Okapi Rainbow has built-in steps and pipelines for detecting encoding issues, corrupt characters and escape characters; Okapi Checkmate and Olifant help to identify and remove segments that are too long or short. For more examples, see the Fernández Rowda and GILT links.

Limitations, support and alternatives

As we have seen, freeware and open-source tools can complement and even bridge gaps in the functionality of commercial translation technology tools. However, the following potential drawbacks should be taken into account:

  • Some tools such as Heartsome TMX Editor or Olifant are no longer actively maintained or developed. As a result, they require the installation of older .NET or Java versions that might not comply with corporate IT guidelines or may no longer be free for commercial use.
  • Notepad++ and Okapi Framework are frequently updated, and the Okapi community, for example, is usually quick to address questions and enhancement requests. However, these tools rely on the voluntary commitment of a handful of programmers. So, some features and components such as the terminology check in Checkmate may still be under construction, or, like Olifant, have not yet been moved to a more modern code base.

Links

Apache OpenOffice: www.openoffice.org

Fernández Rowda, J. M. (2015). 5 Tools to Build Your Basic Machine Translation Toolkit – Part I. www.linkedin.com/pulse/5-tools-build-your-basic-machine-translation-toolkit-fern%C3%A1ndez-rowda?trk=pulse_spock-articles

GILT Leaders Forum 2018, Best Practices in Translation Memory Management v.2.1. https://github.com/GILT-Forum/TM-Mgmt-Best-Practices/blob/master/best-practices.md

Heartsome TMX Editor 8: https://github.com/heartsome/tmxeditor8

The Pitfalls of Using Standalone TMX Editors – Nimdzi Finger Food. www.nimdzi.com/pitfalls-of-using-standalone-tmx-editors-nimdzi-finger-food/ (last lookup: April 14, 2020).

Notepad++: https://notepad-plus-plus.org

Okapi Framework: http://okapiframework.org

Olifant: http://okapi.sourceforge.net/Release/Olifant/Help

  • The localization tools discussed in this article support XML and text-based formats as well as the exchange formats (TMX, TBX and XLIFF) used by commercial translation technology tools. However, commercial and open-source tools may not interpret and handle exchange formats exactly in the same way. There may be differences in the way metadata, special characters and inline tags are processed, and these items may even be removed when exported files are processed outside the application in which they were originally created.
  • Most open-source and freeware tools do not really excel in terms of user-friendliness.

So, although freeware and open-source tools are very useful for many localization tasks, it may still be worth considering a commercial translation technology tool or add-on, or contracting the services of an experienced programmer or specialized vendor for certain use cases.