Corpora and life sciences translation

Since each language has its own scope of meanings for lexical units, grammatical categories and syntactic structures, translators face many discrepancies at different linguistic and extra-linguistic levels. To address this, traditionally, translators used dictionaries, reference books, articles in newspapers and more recently, the internet.

If you obsess over improvement of your translations; question the correct way to translate certain parts of texts; don’t trust suggestions provided by dictionaries; and search for arguments and evidence to support your final decisions, then you are probably an advanced translator. And among many other technologies, you may be interested in a linguistic corpus. If you need your translations to be highly accurate and to follow certain established parameters — such as in the case of pharmaceutical translation — then a corpus may be a good place to start. This is particularly true for translators who may not have access to other tools such as established translation memories (TMs), either for professional reasons or because the language pair is rare and not supported.

What is a corpus?

Language is a focus of interest for many scientists. To optimize their time and efforts for studying and investigating certain language peculiarities, both in its written and oral representation, linguists started using corpora. In other words, they created data-rich banks of texts. Those banks include linguistic and extra-linguistic information. Using special programs allows them to find the data necessary for further analysis within minutes.

Let’s have a look at the pluses and minuses of using a corpus.

+ Every user can create their own corpus or corpora depending on the project. For example, if I work in a legal sphere, I can create my own legal corpus. If I sometimes do pharmaceutical translation, then I can create a corpus for that.

+ A corpus can function as a virtual “language speaker.” It can include original texts of different spheres and styles produced by native speakers. It will make you “feel” the language as it is actually used.

+ A corpus can include as many peculiarities of the language as you wish. It depends on which linguistic information you add to it.

+ You can determine just how large (or small) you want your corpus to be.

+ It can provide you with statistics for usage of certain words, word combinations and phrases in context. Therefore, you can obtain objective evidence, which can help you with your final translation decision.

+ Using corpora allows you to check your intuition.

+ You can use corpora in addition to conventional tools such as Trados. They will serve as the language models and will dispel doubts you may have concerning different aspects of a translation.

Along with these benefits, there are some challenges:

You cannot include all existing texts. This would take a huge amount of time and effort, if it were even possible.

It may take a fair amount of time to create a corpus.

You have to be attentive while adding information about texts, such as while adding annotation.


How to create your corpus

A linguistic corpus is a set of electronic texts with a logical intention connecting them. It is important to arrange these texts according to the following criteria:

1. A corpus must be representative. It has to represent the subject matter, including style and genre where applicable. 

2. A corpus must be authentic. It has to include written or oral texts produced by native speakers who know the subject matter — and who work with it in a professional capacity for subjects such as pharmaceuticals.

3. A corpus must be sampled. You cannot include all existing texts into the corpus. Depending on the strategy chosen, type of corpus and purpose, you have to make choices to include certain texts into your corpus.

4. A corpus must be balanced. The number of textual resources has to be proportional. It is necessary to differentiate between literary and nonliterary texts; books, magazines or newspapers; standard and non-standard language; controlling for age, sex and origin of the authors. 

5. A corpus must be machine-

There are many types of corpora in linguistics. The types we can be interested in are parallel and comparable corpora. A parallel corpus is a set of texts both in the original language and their translated versions, aligned between each other. A comparable corpus is a set of monolingual texts, which can serve as reference for use of a specific language unit in communication. To make parallel corpora you can use the Trados Studio WinAlign tool or ABBYY Aligner for alignment. These applications are quite user-friendly. It is better to check the alignment results personally after automatic work completion sentence by sentence. To do a linguistic analysis of the particular texts, you can add additional information in the form of linguistic annotation such as: 

n Morphological annotation or part-of-speech tagging (POS-tagging), which includes grammatical categories. One of the most famous and reliable taggers for English is CLAWS (Constituent-Likelihood Automatic Word Tagging System). It achieves 97% accuracy.

n Syntactic annotation, which describes syntactic connections between lexical units and different syntactic structures (such as subordinate sentence, verb collocations).

n Semantic annotation, which includes semantic categories a word or word combination refers to and narrower subcategories that specify their meanings.

n Prosodic annotation, which describes emphasis (prosody) and intonation. If you build a corpus of oral texts, this type of annotation also includes discourse annotation, which we can use for designation of pauses, iterations, warnings and so on.

To annotate texts, you can use a program such as Grammarscope, which can be easily downloaded to your PC.

Moreover, a corpus can include extra-linguistic data such as the language, author, translator, the edition year, year of translation, name of the text and so on. You will need to locate all texts in separate .txt files. 

To create a corpus, you can also use special programs like UAM Corpus Tool ( or Sketch Engine ( The first one can be downloaded and used for free. As for Sketch Engine, you can use it free of charge for one month. Then, there are certain fees to choose depending on your plans and purposes.

To use corpora effectively, we need corpus managers such as AntConc, GraphColl, WordSmith, MonoConc, XAIRA, or others — many of which are free. These tools provide ways to collect necessary information. Corpus managers provide the search of specific word types, search of word types according to lemmas, search of groups of word types in the form of phrases, search of word types according to certain morphological signs and search of correct punctuation, parts of words, spelling options and so on. Use of corpora allows not only to measure lexical units in contexts but also to measure data on frequency of word types, lexemes, grammatical categories, co-occurrence of lexical units, their collocations and so on. The frequency helps in determining differences between semantics of synonyms, determining contexts particular for synonymic words, distinguishing between styles and genres, and collocations attributed to certain social, gender and age groups. These statistics help to assure objectivity of your searching results while using corpora compared to other methods. It also assures better reliability for getting the facts correct.  

Corpora in practice

A corpus is a rather universal tool that can be used by any translator working independently. What’s needed is time for creation of a corpus and patience to check it to be reliable for further application. New players who do not already have built-up translation memories or established terminology databases might consider using a corpus. Corpora may be used independently or in combination with other translation tools.

Say you are planning to participate in projects requiring the translation of pharmaceutical documents containing descriptions of some diseases, and you have never made any such translations before. In this case, you can go to proven sources such as the World Health Organization website and search for texts. You can copy what you need (for example, texts in English and Russian for a parallel corpus), and insert them into a .txt file called “Diseases” as in Figure 1.

Then you will need to attentively read through your .txt file again for spelling and punctuation. This will take some time but you will get a reliable corpus. Save the file with the encoding UTF-8 or Unicode (File-Save as) as in Figure 2.

This is needed for the corpus manager to be able to read the contents of such files. This file can be small, but you can always expand it by adding more texts. For instance, you could add descriptions of more diseases. This could be in your pharma corpus, or could be part of a larger corpus consisting of many files named correspondingly (pharmaceutical standards, patient information leaflets and so on). Please note that the sample corpus is not annotated, meaning it does not include any extra linguistic information. You can use this for your own needs or share it with your colleagues. Later after collecting some good translations, it will be useful to add such files (which will contain texts both in the source and target language) to the corpus. They will have to be converted into the same .txt files with the appropriate encoding and named accordingly.

All corpus users may need a corpus manager such as AntConc or GraphColl, which are free, user friendly and do not require any complicated installation.

A corpus is not a translation memory. It is a valuable resource offering reliable ways of using the language. Something that may be of use for translators in certain locales is that once texts and tools are downloaded, the use of corpora does not need any internet connection. It also doesn’t need expensive or demanding software.

Here are some examples of using a corpus in the AntConc environment. Assume that a corpus called “Disease” has been uploaded. If you are interested in what words are used with the word “influenza,” and sort them by frequency, you can search collocation by putting this word into the search field, setting “Sort by Freq(R)” and clicking Collocates at the top panel as in Figure 3.

If you click on the word “virus” in the list received, you will get the concordance, as seen in Figure 4.

And by clicking a specific entry from the concordance, you will see the larger context as in Figure 5.

The clusters section provides good results as well, as may be seen in Figure 6.

The program is user-friendly and can be installed on a PC or used from your memory stick. The corpus, along with the multi-functional corpus manager, provides the opportunity to see words in context.

Ideas to think about

In every beginning, think of the end. Think how much time and effort you spend making your translation smooth and perfect — why not make your own personal smart tool to give you the answers to your topical requests in seconds? Moreover, if you are curious about the language itself, your corpus is always ready to assist in your linguistic investigations and analysis.