From real world to localization classroom — and back again

Like other teachers, localization instructors continually look for material that will enhance the meaning and effectiveness of their courses and furnish students with tangible experience that grounds theory in the reality of the workplace. Students new to the world of localization quickly become aware that while some localization procedures are clear-cut, others require them to constantly reanalyze, reassess, rethink and repeat steps and phases in order to fulfill their objectives. Due to limited classroom time on a project within an academic semester, they must learn to think quickly and be critical of the basics while they are learning them. Some class experiences, however, surpass

The Concordia University (Montréal, Canada) translation program is a diverse range of courses from advanced translation theories to practical specialized subject translation and basic website localization. This diversity is reflected in the students’ interests and backgrounds. One strong trait, however, is a visible commitment to current issues and challenges in social and economic justice, the environment, and health care — globally as well as locally. Not surprisingly, then, students in the 2005 localization class agreed to embark on a class project that would seek to directly assist a global nongovernmental organization in social and economic initiatives.

Off to Ecuador

During the winter of 2004, student Glenn Clavier worked as a volunteer for several months with the YMCA-ACJ Ecuador, helping the association develop local sources of revenue to fund social and environmental projects aimed at empowering the country’s poor, particularly young people and women. The YMCA-ACJ Ecuador focuses on sharing the country’s amazing diversity with the world while fostering socially and ecologically responsible tourism. The revenue project with the greatest potential proved to be YMCA Tours Ecuador, which offers packages that showcase and involve participants in the association’s work as well as tours provided in cooperation with local operators. When Clavier left Ecuador, he made a commitment to continue to assist them. This evolved into a three-phase project: to improve content and launch the task of localizing into other languages (French in particular); to identify and work with a partner or partners to redesign and enhance the look, feel and usability of the site; and to identify a new partner to host the site at no cost on a high-speed server.

Clavier raised the possibility of localizing the YMCA Tours Ecuador website as a class project to Concordia University professor Debbie Folaron, who felt that partnering with the YMCA and YMCA Tours Ecuador presented a real-world project that would hone students’ practical skills while responding to their interests and commitments. Moreover, the University and the YMCA already had a strong historical connection. Concordia was the product of a merger in the mid-1970s between two private schools: Sir George Williams University, founded by the YMCA in 1926, and Loyola College, founded by the Jesuits in 1896.

When the proposal was put on the table the first day of class, the consensus among the ten students was to take on the project, even if it meant that they would need to devote extra time and effort individually and collectively in order to compensate for a lack of immediate resources, infrastructure and adequate meeting time (a maximum of 2.5 hours class time together per week for a period of 13 weeks, which is not even one 40-hour work week).

After a general presentation and overview of the project and a brief assessment of the website online, the group decided to organize and divide up the tasks so time could be optimized during the next meeting. Readings to initiate (catapult) the students into the world of localization were provided through Bert Esselink’s book A Practical Guide to Localization; John Yunker’s Byte Level Research site and his book Beyond Borders: Web Globalization Strategies; articles and information posted on the Localisation Research Centre website; and articles from MultiLingual Computing & Technology.

Already familiar with the processes of translation and terminology management, the students quickly learned about and began to appreciate the usefulness of procedures such as internationalization and globalization in the context of localization. They also acquired first-hand knowledge about project management and the advantages and shortcomings of different types of computer-assisted translation (CAT) tools and technologies, especially when applied to a case that did not follow a textbook model — which is often what constitutes the stuff of a real-life project.

Initial student core on class project: Christina Anderes, Khalida Benamar, Glenn Clavier, Geneviève Doucet, Cindy Joyal, Sophie Lellouche, Alexia Papadopoulos, Nargisse Rafik, Helena Sergakis, Lindsay Schonfelder
Faculty member assisting with GIMP: Fabien Olivry
Students joining the project later: Slimane Boumghar, Ryan Ebro

Time to organize

By the third and fourth sessions, the group had already begun to get a strong sense of the complexities involved behind the scenes when tackling a project that had seemed simple and straightforward. As might be expected from work that had been carried out over a period of time by a series of volunteers, specific batches of web pages had been written in accordance with the diverse linguistic and technical (HTML) writing skills of each individual person. With changing volunteers and with all the evolving web technologies, the website no longer had a clearly mapped-out architecture or direction in design.

The group decided first to assess the project needs and then to distribute tasks according to individual capabilities and preferences. Once the existing English and Spanish versions had been compared and discrepancies eliminated, the anglophone half of the group organized to revise and harmonize the English-language text, while the francophone half of the group organized to translate the text into French. Each language group was subsequently divided into pairs to review each other’s work, under the coordination of a language lead. During classtime, terminology from the English text that was deemed fundamental to serving as a base for the terminology being harmonized in English and translated into French was extracted, discussed and compiled collectively into a format everybody could use.

While waiting for instructions on how to access and download the source files, the group captured files from the site with a webspider and performed word counts. Students assessed the style and presentation of information on the site and determined whether all of the content was culturally appropriate and meaningful. Tangentially, they learned what information would necessarily be furnished to a potential client when breaking down project tasks and calculating time and rates to produce a project estimate or proposal. They learned how to discern file types and discussed which types of tools would be needed to work with the diverse formats presented, in particular should they wish to create a translation memory (TM) for future localizations and updates of the site.

Let the technical fun begin!

Obstacles emerged almost immediately and encumbered any efficient use of time and resources. First, given the patchwork design of the existing site, it proved time-consuming to keep track of the website file structure in a logical, coherent fashion. Then, because more than one level of HTML writing, code tags and character sets had been used to create the web pages, it proved problematic to provide students with a single set of instructions on how to prepare and process the web content in editing and CAT tools.

Steps to equip the public university lab with a full set of standard professional CAT tools were still underway, and HTML editing applications had not yet been installed either, so the students needed to experiment with a wide array of free, open-source or low-cost tools they could download onto their home computers for use. This provoked an equally wide array of problems and issues to troubleshoot case by case. Once a single tool was identified for use on the project, issues related to different results obtained on different operating systems continued to hinder their use.

The site graphics containing text were no longer available in their source file format. The students had to recreate them so that the text could be translated and inserted into the text layers. No commercial professional graphics editing program was available to them in the lab, but a faculty member was well-versed in open-source applications, so they decided to learn the Gimp to create or recreate layered source files for graphics localization into French.

Because of semester time constraints and the diverse configurations of students’ home computers, it was decided that terminology and simple TM files would be created and saved in .csv format so that they would be immediately available and useful for the project translation. Once the HTML files were
revised in English and translated into French by the students for their coursework, they would be aligned and processed later in professional CAT tools to provide YMCA Tours Ecuador with validated HTML files and a bona fide TM for future updates and other localizations.

Everything is not always as it seems

For the last day of class the students prepared to show and discuss their operative terminology list, TM, revised English HTML files, translated French HTML files and recreated image files. In a few hours together over a period of 13 weeks, they had analyzed the site, assessed the needs of the website localization project, allocated their resources, distributed tasks, managed the project on several levels, reviewed a wide array of free or low-cost CAT tools and coordinated efforts to review and carry out basic quality control for language and culture. They had learned about and participated in local initiatives to assist the YMCA social programs in Ecuador and set as their goal to launch the newly revised website for the upcoming international YMCA conference in autumn 2005. All that was left to do was to restructure the website files to accommodate the new language version, validate the HTML, create a TM from aligned source and target content files in professional CAT tools, compare and review the image files, and launch the site — a perfect little summer project.

But the files were not as clean as they first appeared to be. After initial attempts to process them proved futile, the best way to set a course of action seemed to be to consult with an experienced professional. Students continued to express keen interest in the progress of the project, and some who were in and out of the city throughout the summer volunteered their time for updates.

Segments and matches and tags, oh my!

But the trajectory was destined to be anything but smooth. A preliminary quality assurance check in HTML QA revealed more than 1,300 problems. Closer analysis revealed that the free and open-source programs used by the students to edit and translate the content had more than significantly modified code as well as many links throughout the HTML web pages. Would the professional CAT tools on the market be able to effectively deal with the problems encountered and provide a relatively quick solution? No project manager anywhere would ever have the luxury of time to fix 1,000-plus problems if files were returned to him or her in this state for a TM, so the question was indeed pertinent.

HTML QA reported 1315 errors.

The different tools had modified the HTML codes in the source version during revision of the English and in the target version during translation into French. Certain source language files no longer displayed correctly, and numerous problems were visible in the target language.

Such an elevated number of problems was too significant for correction one by one. The first solution envisioned for a case such as this (low volume to translate) was simply to start the job over with the appropriate tools, but given the time allocated for the project it was out of the question.

Analyzing the options offered by diverse professional translation tools, participants considered another solution: to correct the problems (relatively few) in the English version; to carry out a quick alignment in order to create a TM; and finally to pre-translate all the source files with the TM to obtain a translated version of the source files that had identical HTML codes.

Many links had been broken during the translation process.

The goal of this solution was not to create an optimal TM where every segment is matched with its translation, but to use the tool to carry out a fast alignment that would allow automatic retranslation of the original files without altering the codes. Furthermore, the procedure was easy to validate. It would suffice to align the files and accept the alignment as is, anticipating that subsequent analysis would yield 100% of 100% matches. A memory created this way might not be perfect (unless the tool was capable of carrying out a perfect alignment), but if it was only to be used to retranslate the same YMCA Tours files, the result would be correct.

HTML file alignment

In theory, HTML files are good candidates for a fast and reliable alignment. The text to align is often separated by tags that allow the alignment tools to resynchronize easily in case of difficulties. In broad terms, the basic algorithm of an alignment tool is to perform a sentence-by-sentence alignment (by segmenting source and target files); but it can get “lost” during the process if the number of sentences in the source and translated version are very different. So, when comparing the source and target text, the alignment tool uses some tips to resynchronize if and when necessary. For example, to determine if two sentences are candidates to be aligned, it can check if they both contain dates or numbers and if they both contain or are surrounded by tags and so on. This is simply because there is not a date or a number to be found in every sentence.

Since HTML files contain many tags and since the sentence content between these tags is mostly limited in quantity, alignment tools can optimize procedures by first aligning the tags in order to isolate the groups of sentences and then only later try to align the groups of sentences. This simplifies the process and renders it more reliable. This theory is, however, only valid as long as the HTML files to be aligned are translated versions of each other and have not had their tags modified, or very little, so that the same tags surround the same groups of sentences. In the case of the YMCA Tours files, however, the tools used to translate had modified the existing tags or had inserted additional ones.

Alignment, beginning with TRADOS

It was imperative, then, to determine if the proposed solution was valid by first conducting a trial run on a few of the files. We started to work with TRADOS WinAlign and discovered quickly that the alignment did not yield good results.

Taking the time to fix the alignment was not an option. At any rate, it was not important because we knew that if the pre-translation gave 100% of 100% matches, the translated files would be right. We therefore decided to validate the automatic alignment as it was and created the TM. But an analysis performed to count the segments that could be translated with the memory unfortunately gave extremely poor results. This was quite surprising.

If we translate a document and create a memory during the translation process, and then a few minutes later we carry out an analysis of the same document against the same memory, we expect to have 100% of 100% matches. So, why was it not the case in this situation? We had done an alignment, which was supposed to simulate the steps of a translation. We had built a memory. We had run an analysis, expecting to have 100% of 100% matches (source-text comparison). Why were the results not as anticipated?

In order to understand the problem, it is necessary to analyze the three types of links created by alignment tools.

1) A source sentence (or source segment) is linked to one and only one target sentence. This is the ideal, perfect alignment.

2) A source sentence is linked to several target sentences. This is a good alignment because while the TM is being used, this source sentence will be equally identified as a segment and will yield a 100% match.

3) Several source sentences are linked to a single target sentence. This is a problematic alignment because while the TM is being used, segmentation will identify one single sentence at a time and thus will not find the combination of source sentences stocked in the memory.

In this case, if the alignment had only generated links of type 1 or 2, we would have obtained 100% of 100% matches during the analysis. Unfortunately, WinAlign had created many “type-3” links.

WinAlign created many multiple-to-one links.

On to SDLX

We did not have enough time to fix all the problems (and wanted to test another tool), so we decided to try SDLX. At the time, we had only an older version available: SDLX 3.5. We started the alignment process and were delighted to see that SDLX Align apparently did a much better job than TRADOS WinAlign. We aligned a few files, created a memory and then ran an analysis. The results were excellent: 98% of 100% matches.

We began enthusiastically aligning the other files, but soon found ourselves blocked when SDLX simply refused to align HTML files where some tags where missing. For example, if a file had a <font> beginning tag, the tool expected it to have a </font> ending tag somewhere; if a file had a <b> tag somewhere, it did not tolerate finding two </b> ending tags. The YMCA Tours files had many of these unbalanced or added tags since the files had been created by different people using different tools or by manual coding. Nonetheless, the behavior displayed by SDLX was surprising. HTML browsers tolerate these kinds of things perfectly well.

Knowing alignment algorithms, this was even more surprising. An alignment tool should use the tags to resynchronize itself and should not be the least concerned about the value of the tags. In this example, if there are two </b> tags in the source and two </b> tags in the target file, it should still resynchronize perfectly even if one of the tags is technically not useful. Again, the value of the tags should be important for a browser but not for an alignment tool. But SDLX Align simply refused to align the files containing unbalanced tags.

Thinking we could fix the problem easily for a few files, we aligned on a copy of the source and target versions of the website to allow us to build the memory. We did this on a copy. It is essential during the translation process not to add tags in the source, and certainly not in the translated version, of a website. (Remember, the golden rule of localization is to return to the customer exactly the same files as the originals but with translated text.) Since we had many files, it took a few hours to fix the tags, but the result was a good memory. It was finally time now to pre-translate the original source version in order to regenerate the translated version. At this point, we had to switch back to the original source because using our fixed source would have meant returning to the YMCA Tours files with a changed structure (added or removed tags). But the end of this long process seemed to be in sight.

The new surprise, however, was that SDLX would simply crash when trying to pre-translate the source files. As soon as it found a problem with the tags (as it had done during the alignment process), it would stop, tell us it found unbalanced tags (without telling where the problem was in the file) and stay blocked. It was impossible to exit the program without interrupting it with the task manager. After all this work and a good memory, this was extremely frustrating. Thinking that perhaps the problems had been fixed in the latest version of SDLX, we tried the process on SDLX 2005 and reran the analysis.

Two surprises emerged. First, the problem of the tags had been partially fixed, as the program did not crash any more, but it still refused to pre-translate files that contained unbalanced tags. Second, the results of the analysis were extremely poor, much lower than when using SDLX 3.5. In fact, we lost 13% of our 100% matches (see table “Losses”).

SDLX 2005 SDLX 3.5
Batch Statistics Batch Statistics
Words translated (confirmed) 0 0% Words translated (confirmed) 0 0%
Words 100% matched 18208 85% Words 100% matched 21174 98%
Words 95% to 99% matched 1032 5% Words 95% to 99% matched 88 0%
Words 85% to 94% matched 922 4% Words 85% to 94% matched 73 0%
Words 75% to 84% matched 315 1% Words 75% to 84% matched 23 0%
Words untranslated 1034 5% Words untranslated 157 1%
  21511     21515  

Losses: 13% of the 100% matches were lost when
moving the memory from SDLX 3.5 to 2005.

What conclusions could we draw from this? The segmentation algorithm must differ significantly between the older version 3.5 of SDLX and the new version 2005. This could be acceptable, given that there are a few years between these two versions. Nonetheless, we think that a legacy segmentation option should have been retained (we have only seen default, paragraph and TRADOS segmentation rules in SDLX options).

Would a customer who has been building memories with millions of translation units be very happy to discover that the rate of 100% matches dropped dramatically upon upgrading translation tools because the developers had changed the segmentation rules of the product? Furthermore, when talking to SDLX support, we were told that they were aware of the unbalanced tags problem with the HTML filter.

Exchange of memories using TMX — some misgivings

To finalize the exercise, we tried exporting the memory built with SDLX (and standard SDLX segmentation rules) to TRADOS through the TMX option. An analysis in TRADOS yielded less than encouraging results as to any kind of reliable portability of memories, giving us 52% of 100% matches.

As the technical exercises carried out to repair the project files bear witness, certain misgivings arose with regard to use of professional translation tools. Of course, special care should always be taken when choosing tools for translation. This helps avoid unpleasant surprises. But we also encountered other surprising issues. Although some SDLX algorithms appeared to perform more successfully (such as for the alignment), TRADOS seemed to be more stable than SDLX.

An SDLX memory imported into TRADOS gave 52% matches.

Now that SDL has bought TRADOS, we have been told by SDL that they will maintain both products for a while but that in the future the good and bad of both products will be taken into consideration to build the “next generation” tool. What are the implications of this? We have seen that SDL changes the segmentation rules between different versions of its products and in so doing does not preserve the capital of TMs created by its customers. What will happen to memories in the future if they cannot be easily transferred from one product to another? When tomorrow the “next generation” tool is built, which segmentation rules will apply, and what percentage of our 100% matches will be lost?

The project continues

Since the summer technical repairs, some students have returned, and new faces have appeared to help review the files and perfect the images. When hand-corrected printouts were destroyed in a fire that incinerated the professor’s home, the students were quick to assist, retracing their steps, repeating their work and bringing new life to the project. Even though the website could not be launched at the desired time of the international YMCA conference, a special event was organized on October 14, 2005, to welcome guest of honor Patricia Sarzosa, executive director of the YMCA Ecuador, who attended a meeting in Montreal of YMCA executive directors from across North and Latin America.

Clavier and other students continue to verify the content and carry out quality control. Like many professionals who have worked on localization projects, the students experienced highs and lows, and the opportunity to collaborate and participate in a meaningful project enriched their academic, professional and personal lives. M

Debbie Folaron is an assistant professor at Concordia University, where she teaches translation and translation technologies. Philippe Mercier is a partner in Locordia SA, a Brussels-based localization company. Questions or comments? E-mail