Translating from multiple source languages using fuzzy matching

By Sarah Calek October 30, 2015

The basic advantages and disadvantages of a database filled with previously translated sentences — in other words, a translation memory (TM) — are quickly obvious when working with any translation tool. It can, for example, be a great help for repetitive texts, but the quality of entries is crucial. There have been many papers and essays written about the merits and problems with TM systems and I will therefore not address those in this article.

TMs have been around for quite some time and even though machine translation seems to be the more popular research topic, I came across an interesting use for the good old TM while thinking about possible topics for my master thesis. The idea is based on the fact that every TM basically consists of a database containing the segment pairs and an algorithm that is capable of comparing the contents of that database with the current source segment. This comparison is crucial; its implementation varies between translation tools and is not openly available for commercial tools. One system is used in the free translation tool OmegaT, which is essentially based on letter-by-letter comparison and uses a few additional features (such as a built-in tokenizer) in order to improve the matching results.

All of those more or less sophisticated algorithms do, however, have one feature in common: they cannot recognize the languages in the TM database by any other means than looking up the source and target languages in the header and body of the respective TM file. This idea led me to test what would happen if the source language in the translation memory was different from the current source text. The most obvious test languages for me to use were Danish, Norwegian and Swedish, since I am able to understand all three quite well – mostly because they are very similar to each other. I only formally learned Norwegian out of those three, but a bit of practice enabled me to also get along in the other two languages, especially in their written form.

The experiment became part of my master’s thesis and I had the chance to compare different translation tools while testing it. The comparison will not be part of this article; I will concentrate on the general idea of using different languages as resources. As could be expected, different tools do produce different results, but the main outcome was essentially the same. In this article, I will only use SDL Trados Studio 2014, mainly because the differences between a translation memory-match and the current source segment are presented clearly enough to use screenshots.

So, what exactly may this comparison algorithm be used for other than the usual looking up of possible similar segments? The actual use is not very different, but in my experiments, I changed the input translation memory files so that the source language differed from the one that was supposed to be translated. The target languages matched, however. Figure 1 shows an example of such a combination.

This idea may work in cases where the two different source languages are quite similar to each other, so that the matching algorithm may be “tricked” into cancelling out existing differences between the languages as it would when comparing slightly differently worded segments or segments that differ only by a few words or grammatical forms.

In order to illustrate this, I set up a translation project with a little sample text from nordeniskolen.org, a website that provides teaching materials for interscandinavian communication exercises at Scandinavian schools. The site is available in Danish, Swedish, Norwegian (Bokmål and Nynorsk), Finnish and Icelandic and thus provides some good parallel texts. After translating the Danish version into English, I end up with a TM that contained all the Danish and English segments (Figure 2). In order to test if I could use this TM for the same Norwegian source text, (still for the translation into English) I have to “mask” this translation memory to make it fit into a Norwegian/English translation project. This basically involves the following steps:

1. Exporting the translation memory from the Danish/English project into a TMX file

2. Searching and replacing the language codes of the TMX file with the proper language code for the new project. In this instance, I changed

In the TMX-header: srclang= “da-DK” srclang=”nb-NO”

In the TMX-body: <tuv xml: lang =”da-DK”> <tuv xml: lang=”nb-NO”>

3. Saving the new TM file with a meaningful name (to avoid confusion)

Exporting as TMX is the easiest and most versatile way to edit a TM file, as different translation tools use many different file formats. There certainly are other ways, but I found this method to work with many different tools, so I chose to stick with it.

After creating this second TMX file, I just needed to import this into my newly created Norwegian/English translation project (which, in this ideal example, contains the identical text in Norwegian). To the translation tool, it now looks like a matching TM from Norwegian into English, although the containing source segments still are Danish.

Now to the actual test: how much of the Norwegian text can be translated using the Danish/English translation memory? In order to see more matches, the minimal match rate for TM suggestions was set to 50%. Only one segment in this test scenario did not yield a TM suggestion, as Figure 3 shows.

Cross-referencing a TM file from another project with different language pairs worked quite well in the case of Danish and Norwegian. In this ideal test scenario, the source texts were close translations of each other and the plain text did not pose any formatting challenges, which would lead to lower match values.

There are a few limitations that need to be considered and which may be the reason why such cross-language applications do not seem to have been discussed prior to this:

It only makes sense to have differing source (not target) texts, because using the same source language and different target languages would mean that every single target segment that had a match, even if it was a 100% match, would need to be edited. This would easily lead to mistakes, as the translator would not only have a lot of editing work, but the more similar the two languages are, the more confusing it gets to make the correct changes in order to produce a natural-sounding target text.

The different source languages will lead to lower match rates and may sometimes conceal textual differences. Lowering the minimal match range too much would, on the other hand, produce a lot of unhelpful matches. Only allowing very high match rates to be shown may lead to almost no translation suggestions, which would not be helpful either. A minimal match rate of 50% seems like a good compromise.

Naturally, only very similarly written languages would be worth testing. The language that needs to be “masked” in order to fit into the current project will be treated as if it actually was the other source language. Phonetically similar languages only produce results if they are similar in writing as well.

As a result of these limitations, possible applications (outside the university context) are probably few. According to my research, there don’t seem to be any current publications that consider using translation memories with differing source languages.

It is, however, a very interesting way to understand TM systems better and to analyze languages and translation tools from a different angle. Other instances in which this kind of cross-referencing actually may prove quite useful would be the creation of terminology databases from previously translated texts. Provided the terminology in a special field is quite similar across two or more languages, a list could be translated in one language pair and then transferred to another similar source language by using the TM from the first translation as a terminology database. Similar source terms could potentially be recognized using the fuzzy matching algorithm and the according translation could be applied across multiple source languages without having a matching terminology database available.

This idea is the only one that seems practical for this method. The drawback of possible false matches is minimized in smaller segments, as it would generally be quicker and easier to check small segments. But there is one problem when trying to use a TM as a terminology database: the contents would not be handled as terminology entries by the translation tools but as normal TM segments. The following small example shows this difference:

Norwegian: Hvorfor er nabospråk viktig?

Danish: Hvorfor er nabosprog vigtigt?

English: Why are neighboring languages important?

In the case of a TM, each of the above sentences (either the Norwegian or the Danish one and the English sentence) would be treated as one entity to be used in the matching process. Overall, Norwegian and Danish are so close to each other that it would in this case be possible to cross-reference the whole sentence. But what would happen if a translation memory only contained the term “neighboring languages” (marked in yellow) and one corresponding translation? This would result in a much worse match, as the whole segment would be considered. The chances of finding a match would be even worse if the segments in the source document were longer. It is therefore unrealistic to use TM files directly as terminology databases — even without cross-referencing another language.

Converting TM files from a translation tool (for example, from a type of XML file) into a file format that could be used as import format for a terminology database is not an easy procedure. The exchange formats TMX (for translation memory files) and TBX (for terminology files) are unsurprisingly not interchangeable.

It would certainly be possible to convert the contents of a TM into a table or plain text format by manually editing or copying relevant contents into a table or other format which could be used as basis for terminology imports. However, none of the ways I tested this was very practical. It would therefore be more promising to use terminology files from the start and to edit them analogously to TMX files, changing the language designations as necessary and importing “masked” TBX files. I have not yet tested this in the same way as TMX files, but as TBX files also contain the attribute xml:lang [3], it would be surprising if translation tools were more difficult to “deceive” with terminology files than with translation memory files. The result should be a terminology database that contains terms in another source language but the same target language. If the source languages are similar enough or the entries are internationalisms or technical terms, this “masked” terminology database might still produce some helpful results.

The most important outcome of conducting these tests and dealing with the different exchange formats has however not been increased translation workflow by cross referencing languages. It has rather been a great way to become familiar with those file formats and with three translation tools which I work with on a daily basis. The chance to become creative with those tools during my studies has been a great way to become more comfortable working with different tools and to become aware of some of the basic functionalities that a freelance translator works with every day. It may not be necessary to know exactly how a translation tool works as long as it does so immaculately, especially if all tasks are prepared by project managers and sent as translation packages. A few months of real-life experience have, however, shown that this would be a very optimistic expectation. Knowing the tools of the trade and being able to solve technical problems are keys to using translation tools effectively and testing their limits has been a very interesting and instructive way for me to learn more about the technical aspects of translation.