Creating translation-oriented source documents

By Nicole Keller September 28, 2011

Here are some tips for optimally preparing source documents for translation during their initial creation. A well-prepared and cleanly-formatted source document can save a lot of time and money during translation with a translation memory (TM) system since the recognition capabilities of the TM system only make sense if the segments to be translated are actually identical or similar. These rules are derived from practical experience. At first glance they may appear insignificant to the author of a text, but for the translator, some of these things present considerable problems.

PDF files vs. original file formats

Whenever possible, avoid using PDF files as the source document format for translation. Always try to provide the original file format that served as the basis for the creation of the PDF files since PDF files cannot currently be edited in some programs and instead have to be transformed into another format (usually Word) before translation. The transformed documents must generally be edited again before translation since the converted text usually contains too many formatting errors to be able to translate it sensibly with a TM system. This editing is always associated with additional time and cost and delays the start of the translation.

Hard line breaks

Avoid hard line breaks (paragraph marks) within sentences; otherwise, no sensible segments can be offered for translation. Line breaks should only be used if a new paragraph is actually started. TM systems decide using segment end limiters where a translation unit (normally a sentence) ends. These characters are generally ., !, ? and ¶. A line break is always detected as a segment end, and manual editing is required if the line break is within a sentence and subdivides it into two segments as a result. Manual adaptation by the translator requires additional time, and the initial analysis will find fewer hits in the TM (matches), which will make the translation unnecessarily expensive. Frequently, hard line breaks in PowerPoint or in desktop publishing programs are put in the wrong place because translators do not have sufficient knowledge about how to work with these programs. Below is an example of a sentence that contains an incorrect line break (Figure 1) because the text was copied from a PDF file. The sentence is subdivided into two illogical segments as a result, and manual adaptation by the translator is required.

Soft line breaks

Soft line breaks (Ctrl+Enter) should also be avoided. TM systems do not interpret them as segment ends, which is why such units are not detected correctly and they have to be re-worked manually by the translator. Below is a text sample that contains a soft line break at the end of each bullet point (Figure 2), which causes the whole text to be offered as a single unit for translation. Soft line breaks are frequently inserted unintentionally by copying texts from various applications into source documents. This happens quite frequently if the text to be translated is copied from an e-mail into a Word document, for example.

Manual page break

Very often manual page breaks are inserted for formatting purposes — because a headline falls at the bottom of a page, for example. To improve the layout and the readability of the text, the author inserts a manual page break at a specific location. However, during translation, texts usually grow or shrink in length depending on the language combination, so it is very unlikely that the manual page breaks from the source text should be placed in the same location as in the target text. Usually the manual page breaks are not “translated,” but are skipped during translation. They are inserted into the final version of the text after the translation is finished and the text is converted back into its original document format.

Blank spaces and tabs

Try to use tabs or indents to indent texts and do not use a series of blank spaces to do this. After reading the document into a TM system, these characters are all displayed. In 99% of cases, the translation with the same blank spaces will look different than it does in the source document. The text then has to be reworked after the translation in nearly every case. Compare a sample text as it might appear in Word (Figure 3) with similar sample text as it might appear in crossDesk (Figure 4). While working, a translator can only assess with difficulty whether the blank spaces actually have a particular function or whether they only serve formatting purposes. If we assume, for example, that the translator does not delete the blank spaces from the translation and tries to put them in the same place in the target text as in the source document, then after the export, the translation would look like the data in Figure 5.

Date, time and number formats

For the detection of date, time and number formats, TM systems orient themselves according to specified rules. Thus, in the Across system settings, it is specified whether date formats in German have the format DD.MM.YYYY or DD.MM.YY. If there is a blank space between the numbers, the date is no longer recognized as a coherent number group, and it cannot be checked for correct usage in the translation. This happens frequently with dates and numbers in the thousands. See, as an example, 08. 10. 2011 vs. 08.10.2011 (Figure 6) and 5.000 vs. 5 000 (Figure 7). The red lines indicate to the translator that there is a number and which number areas have been detected as coherent units. If the red line is interrupted, several units were detected. In this case, no sensible checking can be done to ensure the correct takeover of the number formats. It is therefore recommended that for texts that contain a lot of date, time and number formats, you specify a uniform format and use it consistently.

Uniform spelling  of specialized terminology

The uniform spelling of specialized terminology is essential for correct terminology detection. In German, for example, writing a term as one word (Diabetesbehandlung), as two words (Diabetes Behandlung) or connecting two words with a hyphen (Diabetes-Behandlung) is a frequent cause of inconsistent translation since the automatic terminology detection is not activated in these cases.

Figure 8 shows an example of correct spelling. The red marking indicates that the term is present in the crossTerm terminology database. The stored translation is suggested to the translator. The translator can then take over this suggestion directly into the translation. Figure 9, however, shows two examples of alternative or incorrect spelling. The system does not recognize the specialized term, and the terminology window remains empty. Thus, the translator does not know that there is existing terminology information and may translate the term inconsistently or incorrectly.

Usage of correct abbreviations

Always use the correct spelling of abbreviations. If you use different spellings for one and the same word, the segmentation of the sentences will be incorrect. For example, if you look up the word approximately, you will find app., approx. and apx. as abbreviations. In Across, app. and approx. are defined as abbreviations for approximately, but if you use apx. the sentence will be subdivided in two segments and manual adaptation by the translator is required.

Superfluous formatting

If you work frequently with colored marking in the text in order to visually emphasize text passages, you should make sure that this has been removed completely before translation and that there are only line breaks and blank spaces remaining. Otherwise, this “invisible” formatting is offered to the translator as possible formatting and may result in the translator writing directly in the wrong font or color.

Hyphenation

If you want to use hyphenation in your text, make sure that you either use the “automatic hyphenation” function or insert an optional hyphen manually. Many people just insert a normal hyphen instead of using the automatic hyphenation or the manually inserted optional hyphen. In this case the translator and the TM system face the following problems: standard hyphens are recognized as normal characters and add an additional character to the word in which they are placed. This means that the translation unit stored in the TM will not be a 100% match even if exactly the same sentence appears again without hyphenation. Addtionally, terminology recognition for this specific term will not work since the term is separated by an incorrect character. Compare the first sample sentence with correct hyphenation in Word (Figure 10) with the second sample sentence, with incorrect hyphenation in Word (Figure 11).