Traversing the Eastern 'block' with translation tools

Traversing the Eastern ‘block’ with translation tools

By Michal Küfhaber & An Stuyven February 28, 2012

In the course of 2011 most Central and Eastern European (CEE) markets began to see the first signs of recovery after the crisis years 2009 and 2010. Even though for 2012 the forecast for the region is one of economic stagnation, some experts predict an upward trend for 2013.

This offers hope for the further growth of most economies in this region and the markets are seeing an increasing interest in localization projects. The CEE region is a linguistically fragmented area with a relatively small population of around 110 million people; however, at least 13 different main languages can be identified. These small language groups are bound to struggle to keep up with the latest developments in areas such as machine translation (MT). Overcoming their limitations is quite challenging. For instance, the linguistic corpora are small in comparison with Western languages, and the possibility of MT training is therefore more narrow. A similar issue concerns new trends such as auto-suggest features that usually require translation memories (TMs) of about 25,000 translation units. There are only a few large global companies translating such volumes into CEE languages.

A variety of issues should be taken into account to produce high-quality localized products. First and foremost are language and cultural topics, which are the main factors determining final quality. To minimize the amount of time and money required to produce such high-quality localized products and texts, several technical issues have to be considered, including computer-aided translation (CAT) tools, quality assurance (QA) tools and terminology tools. All of these tools must first have their specifics set either manually or automatically for each language. Specifically for CEE languages, including languages using the Cyrillic alphabet, there are and have been issues for which the tools were not primed. In the past, these issues involved elements such as character encoding, which had to be handled with special care both in CAT tools or graphic software. For instance, it was not possible to clean up certain Bulgarian characters in the CAT tool Trados 6.5. In the graphic software FrameMaker 7, certain accented Polish or Czech characters could not be handled. In order to overcome these difficulties and to deliver error-free, high-quality localized products in CEE languages, special procedures had to be developed internally by translation agencies. However, as everything is subject to evolution and improvement, so, fortunately, were the tools, technology and standards used during the localization process. Thus, we no longer deal with many of these issues, thanks to the use of the Unicode standard and OpenType fonts, and also thanks to the work of dedicated developers and authors focused on internationalization. Of course, other problems such as text expansion after translation into languages that need more words or characters still exist. This text expansion requires extensive technical post-processing if it is not taken into account at an earlier stage, and is a significant issue with Baltic or Slavic languages in particular.

On the other hand, there are new issues that have arisen as a result of market and consumer changes. Product quality alone is no longer the crucial sales argument for companies seeking to explore CEE markets. Geographical proximity and concurrent time zone availability have become increasingly important for customers. Consumers demand that service staff or representatives are available locally and speak their language. Companies have been forced to adapt to these new challenges and establish local subsidiaries or set up local representations. As a result, CEE translation companies, which have until now mostly provided services into CEE languages, are being continually challenged to provide services from CEE languages into English. Those companies require, for instance, the translation of documents produced by local subsidiaries into the native language of the parent company. Whereas language resources and technology already exist for most Western languages, there have not been many comparable efforts for CEE languages to develop a standard for corpus encoding and linguistic software.

Stemming

Since the grammar of many CEE languages is based on Old Slavic cases, nouns, pronouns and adjectives undergo declination, which means word endings change significantly. This is a major obstacle when translating from a CEE language, especially when applying any automation tool. Terminology in the non-English source text may not be properly recognized by CAT tools and thus will not work efficiently. To overcome this, the terminology system would be required to support stemming (morphological parsers). Some translation tools do not support stemming at all, or, if they do, only for a limited number of languages, typically the major ones. Another language-independent technique based on n-gram similarity approach can be used for dealing with this particular issue.

Alternatively, we can talk about a partial language-dependent model if a different threshold for the fuzzy term detection is added, as some languages have more inflections than others. For example, the Czech equivalent of the English term code (kód) won’t be recognized by MultiTerm in the genitive (bez kódu) (Figure 1). Otherwise, it would be necessary to create terminology entries for all inflected forms, which would be quite time consuming. Many translators within the CEE region refrain from creating and using terminology in language tools due to this fact. This problem also appears during QA processed on a translated text. Terminology checking is almost impossible in QA tools, since the ratio of false errors due to non-recognized terms caused by the inflection of words is simply too high. Some tools, such as ErrorSpy, provide the option to define a list of custom prefixes and suffixes to improve the stem recognition of terminology entries. We shouldn’t forget to mention concordance in TMs. Although some language tools support a search for words by stem only, most of them require the use of wild cards in order to be able to find additional occurrences of inflected words other than in its basic form. Czech, for example, has seven cases. The English term torque, which is točivý moment in Czech, can therefore occur with various endings (Table 1).

Declination also has an impact on TM settings (Figure 1), which should be carefully considered before starting a project into CEE languages. The “allow multiple translations” option is recommended for most CEE languages. There are several reasons for this, such as different endings of plural variations depending on the numeral indicator. For instance, if you have repeated segments in tables or bullet lists reading 2 years, 5 years and so on, the correct translation of years for the numbers 2 – 4 is roky, while for numbers above 5 only let is correct. Therefore, both variants should be stored in the TM to avoid incorrect translation. Terminology variations for something as simple as name, which can be translated differently depending on whether name refers to a person or company, are another example.

Segmentation in CAT tools

Another example of an issue influencing translation efficiency in CEE languages using CAT is proper segmentation. This allows for the most efficient reuse of linguistic resources and makes the work of translators more productive by avoiding the need for them to spend time aligning the source segments due to improper segmentation.

CAT tools contain a default set of source language abbreviations, and they further define exceptions that are not to be considered as segment separators when segmenting text. This list, however, cannot be considered in any way as a complete list, but rather a collection of the most common cases. In order to reach a good level of proper source text segmentation, the abbreviations list has to be customized and extended to contain the most frequently used examples in all target languages. Again, this is more important if you translate from CEE languages as a source.

Some of the tools allow these adjustments to be made on the server version of the language translation technology being used (such as Across), and these settings are implemented in the project and passed on to linguists with the translation packet. For some other tools, these changes can only be made on personal editions (such as SDL Trados), which requires a lot of communication and close cooperation with all involved linguists to ensure all these settings are agreed upon. We can demonstrate this issue with Russian name abbreviations, such as P. I. Tchaikovsky. Without any adjustments, the segment “Preview, buy and download P. I. Tchaikovsky — Romances” will be split into two segments: “Preview, buy and download P. I.” and “Tchaikovsky — Romances.” This may cause linguists extra work with excess mergers and may result in the inability to pretranslate files, auto-propagate and populate segments properly.

If one is not aware of these issues and care is not taken over the proper customization of language technology with the help of local in-country language specialists, the translation process can suffer from productivity losses, and linguists may be constrained in the use of the technology. Customization of segmentation rules can also be helpful on a project basis. For instance, when you translate labels with ingredients for your client, it may be more efficient to set a segmentation rule for this specific project to include commas and brackets as delimiters so that each individual ingredient becomes a segment. This may significantly speed up the translation process and make it more consistent and cost-effective. Having professional technical support for your project manager as well as clients from an experienced CAT tool specialist is a must these days, and the role of such experts is becoming significantly more important and necessary within large translation companies.

Locales and formats

CAT tools usually automatically adapt the format of dates, numbers, times and measurements to suit the target language, which normally increases the productivity of linguists. Some tools rely on the regional settings defined in the Windows setup of the target language (SDL Trados), yet in others, the rules and conventions for date, time, number and measurement are defined in system settings (Across) and might require custom configuration for some categories or languages. We always recommend checking these settings before the start of the project to save time and trouble. Improper settings can cause numerous issues in a file or project if they need to be fixed after a project has been completed. The standard regional settings of some languages are not correct from a linguistic point of view and have to be adjusted before the translation process begins. For example, the default date setting in the Czech regional settings in the Windows setup is DD.MM.RRRR; however, the correct format should include spaces: DD. MM. RRRR.

In most translation tools you can, besides the regional settings, opt for using Microsoft Word spellcheckers. However, the standard Microsoft Office installation only contains a spellcheck function for a limited number of languages — approximately four. If you internally deal with more languages and, as an example, run a final in-house check after the translator and reviewer have completed their work as a part of your standard QA procedures, you should always have Microsoft Proofing Tools with spellcheckers for multiple languages.

One final factor that needs particular consideration is that local subsidiaries regularly use English master templates to create their own documents. If they need to be translated into other languages, it is necessary to check whether the text in the whole document is set in the correct language parameters in the settings. It may otherwise result in the CAT tool setting the language of the translation unit according to the language set in the document, which can lead to later problems with localization. A similar issue can often be encountered in the different layers of InDesign files.

In summation, many small details and settings have to be checked and taken into account when dealing with the technical preparation of translations both from and into CEE languages. Generally speaking, we can say that the tool providers are mostly aware of the accompanying issues and have promised to implement improvements in future versions.