Improving Spanish MT output with pre-editing guidelines

By Martín Ariano Gahn & Celia Rico Pérez May 17, 2013

Although machine translation (MT) is still viewed unfavorably by large swathes of professional translators worldwide, many companies have come to realize that the use of such a software is the only viable solution for translating extremely large projects that need to be completed within a tight timeframe. Likewise, an improvement in the output quality, both for rule-based (RBMT) and statistical (SMT) engines, has been possible as a result of the continuing development in natural language processing technology.

As much as this holds true for the current state of MT, the fact remains that human intervention is mandatory if the translation is to be of publishable quality, or in other words, if it’s supposed to be a text that conveys the exact meaning as the original and that is free of any syntactical, grammatical or lexical mistakes. To that end, pre-editing and post-editing have become the methods of choice to improve the output quality of MT software.

This article focuses on the development and application of pre-editing guidelines to be used in an RBMT environment working from Spanish into English. The proposed 19 pre-editing guidelines are based on a practical experiment where English controlled authoring rules have been adapted in order to pre-edit the Spanish source text that was then translated with Lucy Software’s free MT engine. The relative effectiveness of those guidelines was put to test by comparing the MT output for both the pre-edited and unedited versions and measuring the output quality in terms of post-editing effort.

For the purposes of this article, we will focus on presenting and briefly explaining these guidelines, some of which are considered to be applicable to several languages while others apply exclusively to Spanish, in the hope that translation agencies using MT to translate Spanish content into English can improve the input quality and therefore make MT use more profitable and less frustrating and troublesome for post-editors.

Pre-editing rules

Rule 1: avoid long sentences. This is hailed as the cornerstone pre-editing rule, which according to Sharon O’Brien in “Controlled English: An Analysis of Several Controlled Language Rule Sets” is the only rule shared by virtually all controlled languages developed for English. Logically enough, longer sentences tend to contain more ambiguities and syntactical complexities, whereas shorter sentences tend to be more easily parsed by the MT system. Even though there is no recommended set limit of words per sentence, we should always strive for simplicity and brevity. Long sentences can be shortened by repeating subjects or using adverbs or other connecting devices instead of using conjunctions.

Rule 2: use language logically, literally and precisely. Admittedly, this is an extremely general rule. How do we go about using language in a logical, literal and precise way? As a rule of thumb, we should make a conscious effort to adhere to the literal meaning of words, mainly as defined by dictionaries, as well as to the logical construction of phrases. Let us illustrate this point with an example: Este informe compara los salarios de diferentes departamentos cuyos empleados tienen el mismo nivel educativo. This sentence, although grammatically and syntactically correct, is illogical because it is not the departamentos that earn a salary, but rather the employees working in those departments. Therefore, a more logical, literal and precise way to write this sentence would be Este informe compara los salarios de los empleados que trabajan en diferentes departamentos pero que tienen el mismo nivel educativo.

Rule 3: avoid idioms, figurative language and cultural references. The use of idioms, figurative language and cultural references is one of the best-known negative translatability indicators and poses a particular difficulty for MT systems, not only because of the mismatch between languages concerning the use of idioms and other multiword units such as collocations, but also because of the cultural aspect. Likewise, idioms can usually be interpreted in either a literal or a figurative way (think of the case kick the bucket/estirar la pata).

Given that RBMT engines mostly offer a word-for-word rendition, any culturally-oriented or creative use of the language will get lost in translation with MT and will elicit some laughter from the post-editor who will be amused by the absurdity of the output. Thus, instead of using phrases like Mi madre solía decir que el que mucho abarca, poco aprieta or La Roja volverá a ganar, try to strip those expressions to the very basics by removing the cultural contents and converting them into more straightforward and pared-down sentences (such as Mi madre solía decir que uno no debe hacer muchas cosas a la vez and La selección española volverá a ganar).

Rule 4: don’t omit words. Context, pragmatic knowledge (knowing what expressions mean in particular situations) and real-world knowledge (also termed common-sense knowledge) allow a writer to omit words which could be obvious to a human reader. However, it goes without saying that an MT system lacks this kind of knowledge and is unable to understand a text beyond the actual words appearing in a sentence. For example, in a phrase such as El conjunto de objetos transportados por el viajero durante su viaje, bien sean de mano o facturados poses no interpretation issues for a human being; however, the MT system will not understand that de mano and facturados refer to objetos, which in this particular case means the traveler’s luggage. To avoid possible misinterpretations, inserting the missing noun and saying El conjunto de objetos transportados por el viajero durante su viaje, bien sea equipaje de mano o equipaje facturado is recommended.

Rule 5: use a verb-centered style. In The Global English Style Guide John Kohl recommends using verbs to convey the most significant actions and thus avoid unnecessary verb-noun collocations. Instead of writing Sirve también para hacer llamadas y enviar SMS, avoid the more complex hacer llamadasconstruction and replace it with the verb llamar. The meaning remains unaltered and the phrase is less complicated.

Rule 6: don’t use non-pronominal verbs as pronominal verbs, or vice versa. Depending on the type of Spanish, there is some variation in the usage of pronominal verb forms versus non-pronominal verb forms. Some usages are typically dialectical and deviate from the academic norm. It is always best to check an authoritative source, such as the Diccionario de la Real Academia Española (DRAE) in order to make sure we are conforming to the standard usage. This is important because a dialectical usage will most likely not be recognized by the MT system. For instance, if we say Me regresé a casa or Estoy recuperando de una lesión muscular, the online version of Lucy Software translates these sentences as I went myself back and I am recovering of a muscular injury, whereas if we abide by the standard usage, that is, Regresé a casa and Me estoy recuperando de una lesión muscular, we get I went back home and I am recovering from a muscular injury.

Rule 7: simplify verbal periphrases. The Spanish language has a wealth of verb periphrases, which express nuances that cannot be conveyed by either the simple or the compound verb forms. Texts that are meant to be processed through an MT system should use verbal periphrases sparingly because not all languages have equivalent verb forms. Therefore, it is recommended that periphrastic forms be replaced by simple forms. For instance, instead of saying La Carta de Servicios de la Agencia Tributaria que ahora se actualiza viene a mantener y profundizar las líneas de actuación puestas en marcha desde su creación, try using simple verb forms, such as La Carta de Servicios de la Agencia Tributaria que ahora se actualiza mantiene y profundiza las líneas de actuación puestas en marcha desde su creación.

Rule 8: consider using similar structures for coordinated clauses. It is always best to use the same structure and order the elements of the sentence in the same way when clauses are coordinated. This will result in a clearer text. For example, replace the infinitive (retirar) in El horno deberá limpiarse con regularidad y retirar cualquier resto de comida with an impersonal structure similar to the one at the beginning of the sentence, such as El horno deberá limpiarse con regularidad y deberá retirarse cualquier resto de comida.

Rule 9: make sure gender and number agreement is correct for all parts of speech. This common-sense rule needs no further explanation. Controlled languages follow the quality at the source principle, which means that all potential errors contained in the original text will be replicated in the translation, or will somehow affect or impede understanding in the target language. The best practice is always to check the original for correct agreement of all parts of speech.

Rule 10: make sure your pronouns have a clear antecedent. Anaphora resolution is one of the most common pitfalls of MT. Most MT systems don’t satisfactorily deal with anaphoric reference and, when they do, it is only at the sentence level. Anaphora resolution is a much studied and still unsolved problem in MT research. Until an answer is found, just try to avoid anaphoric expressions or, if needed, explicitly mention all antecedents without any ambiguity.

Rule 11: make the subject explicit without overdoing it. This rule deals with a naturally occurring and frequent omission in Spanish, that is, the subjective case of personal pronouns (Yo, tú, él/ella, nosotros and so forth). Unlike English or French, the Spanish language does not need a subject for the sentence to be grammatically correct, since that information is expressed by the inflected verb. A particular problem arises for MT when we are translating from a language that does not require a subject into a language that does require it, such as the Spanish into English combination. Studies have shown that SMT engines have less difficulty in inserting the missing subject pronouns than RBMT engines. That notwithstanding, the more pronouns that appear in the text, the less ambiguous the text will be, and the MT system will not have to guess what the correct pronoun is. For instance, how is the machine to know what the correct subject is in a sentence like Menos mal que sabía que venía? As Spanish-speaking people know, these verbs (sabía and venía) could correspond to several subjects (yo, usted, él/ella) and the machine has no way of knowing what the real subject is in that particular context. A caveat is in order regarding this rule, though. We cannot add a subject pronoun for each verb because it would sound awkward. Let’s take imperative sentences, for instance. We could say Tú ven aquí or Usted venga aquí, but only if we want those sentences to be emphatic. Otherwise, it wouldn’t make sense to include the subjective pronouns.

Rule 12: avoid placing the subject after the verb. Spanish has traditionally been considered an SVO language, meaning that the most common order of the elements in a sentence is subject + verb + object. However, there is some flexibility as far as word order goes, unlike English, which has a more or less fixed word order. We have seen that when we place the subject after the verb, which is possible in Spanish, the MT engine sometimes finds it difficult to recognize a postponed noun as the subject of a sentence, which is why it ends up inserting a pronoun in order to make the English sentence grammatically correct. Therefore, instead of writing, El domingo continuará la inestabilidad y no se descartan algunos chaparrones, write El domingo la inestabilidad continuará y no se descartan algunos chaparrones. Considerations similar to those in Rule 11 apply, since this strategy cannot be used in all cases. For instance, in a phrase like Es la señal que recibe el dispositivo que estás utilizando para comunicarte, we cannot place the verb recibe at the end of the sentence, because it would be awkward to have a verb after such a long subject. However, we could have recourse to other methods, such as using a participle: Es la señal recibida por el dispositivo que estás utilizando para comunicarte.

Rule 13: always end a sentence with a period and consider using a comma to separate prepositional phrases from the rest of the sentence. Generally speaking, punctuation rules must be strictly complied with in all texts that are to be processed through an MT engine. An accidentally misplaced punctuation mark can change the meaning of a whole sentence as well as the parsing done by the MT engine. On one hand, a full stop is always required as the only end-of-sentence marker. Likewise, it is best to separate adverbial clauses from the rest of the sentence by using a comma, as in the following sentence: Cuando se produzcan estas interferencias, se pueden eliminar o reducir siguiendo las siguientes indicaciones.

Rule 14: avoid excessive use of prepositional phrases beginning with de.

In Spanish, nouns are not clustered together. The main word comes first and the modifiers are linked to it with prepositional phrases. The preferred preposition is de, which is particularly polysemic (in fact, the DRAE lists 27 different meanings and uses). When there are more than three prepositional phrases together it is difficult to understand which prepositional phrase modifies which. Therefore, it is best to use other structures, such as other prepositions or relative clauses, to disambiguate. For instance, a sentence like El sistema establecido de seguimiento y control del cumplimiento del Plan de Objetivos asegura la calidad de los servicios y procedimientos de la Agencia Tributaria could be changed into El sistema establecido para seguir y controlar el cumplimiento del Plan de Objetivos asegura la calidad de los servicios y procedimientos que la Agencia Tributaria ofrece.

Rule 15: avoid using impersonal structures, such as the hay que + infinitive structure and the impersonal se particle. Impersonal structures tend to cause problems for MT software, mainly because there is no subject that the system can interpret as a subject of the sentence. There are several impersonal constructions in Spanish but the RBMT system we worked with had difficulty translating the hay que + infinitive structure and the impersonal se construction. In the first case, Lucy Software would present us with several alternatives for the third personal subject (it/he/she) or drop a subject altogether, which is not grammatically correct in English. We recommend replacing this construction with a personal construction, such as in the following example:

Unedited input: Ya no hay que ir a grandes restaurantes minimalistas porque las tapas presentadas a concurso te sorprenderán por su calidad y buen precio.

Pre-edited input: Ya no tienes que ir a grandes restaurantes minimalistas porque las tapas presentadas a concurso te sorprenderán por su calidad y buen precio.

The impersonal se construction is also problematic for MT. Seemingly, Lucy Software has no difficulty in translating an impersonal se construction as long as there is a noun that could be decoded as subject of the verb, such as in the sentence Se instalarán progresivamente en las oficinas de la Agencia Tributaria sistemas automáticos de gestión de tiempos de espera, where sistemas automáticos de gestión de espera is interpreted as the subject of the impersonal verb se instalarán. Rather, it is those sentences where there is no passive subject that cause problems for MT. We recommend inserting a subject in such cases like in the following examples:

Unedited input: Se estudiaba menos antes.

Pre-edited input: La gente estudiaba menos antes.

Unedited input: Se vivirá mejor en el futuro.

Pre-edited input: Viviremos mejor en el futuro.

Rule 16: eliminate colloquialisms. Colloquialisms and slang expressions change very quickly and are not always included in dictionaries; therefore, MT systems are unlikely to recognize this sort of language. Likewise, informal usages, such as diminutive and augmentative suffixes, which add an emotional layer to communication, should be avoided since the MT engine generally does not recognize them. In accordance with this rule, then, instead of saying Ha hecho un mogollón de trabajo or Ha hecho un montonazo de trabajo, it is best to say Ha hecho un montón de trabajo.

Rule 17: use standard spelling. Although Spanish spelling is very much unified across its varieties, it is always best to check an authoritative source if in doubt about the spelling of a word. Misspelled words will not be recognized by the MT system, so make sure to run the spellchecker on the text before processing it for translation. Take into account that any minor variations, such as the use of a dash, may affect the way the MT software interprets a word. For instance, if we write socio-lingüística, Lucy Software incorrectly translates is as “member|partner-linguistics,” but if we write the word without a dash, then the software presents us with the correct translation.

Rule 18: eliminate localisms. According to a report published by the Instituto Cervantes in 2010, Spanish is spoken by more than 450 million people worldwide. It is spoken on the five continents and is the official language of 19 Latin American countries, Spain and Equatorial Guinea. Also, it is the second most common language in the United States. This geographical diversity gives rise to wide dialectal differences on the lexical, syntactical, morphological and phonological levels as can be seen in Table 1.

An international audience calls for standard usage of the language. Researchers agree that differences among all Spanish varieties diminish greatly in a formal register of the language, which is the language as used by the media and academia. Scholars also agree that the most differences among all Spanish varieties are found in the lexical component. In order to address this hurdle to global communication, the concept of neutral Spanish, that is, a Spanish form that is understood by Spanish speakers from all over the world, has been developed. Although writing in this so-called neutral Spanish is easier said than done, it has been extensively used in the localization and audio-visual translation industries with an acceptable measure of success and it stands to reason that the same principles apply to controlled authoring for MT.

Rule 19: eliminate acronyms and abbreviations. Unless they are very well-known, acronyms will most likely be an unknown term for any MT system. Refer to Figure 1, where we can see that some widely used and well-known acronyms have been correctly translated into English. However, the acronym ONG, which should be translated as NGO, has been left in Spanish (shown in red).

Final comments

It has been demonstrated through business-oriented experiments that MT output quality improves, and consequently the post-editor productivity increases, when the text is pre-edited before processing it through an MT engine. Admittedly, the relative effectiveness of the above-mentioned rules will largely depend on the type of MT system — a guideline that improves the translated output for one MT system might have a negligible effect for a different MT system. Likewise, we cannot introduce changes to a text that would sound unnatural to a native speaker in order to make it amenable to MT, so due care is needed when applying some of the preceding rules. Last but not least, some text types impose certain stylistic conventions that cannot be overlooked and which make a text particularly intractable to MT. All of these considerations should be taken into account in order to see if the pre-editing effort would be worthwhile.