Using Let'sMT! to teach about SMT

Using Let’sMT! to teach about SMT

By Hanne Fersøe, Dorte Haltrup Hansen, Lene Offersgaard, Sussi Olsen & Claus Povlsen December 17, 2013

During the period of March 2010 to August 2012, a European consortium developed the Let’sMT! translation platform with support from the European Commission’s information and communications technology policy support program. The platform offers cloud-based, user-tailored statistical machine translation (SMT) and online sharing of training data.

The platform is aimed at professional use in localization where it can be accessed directly from an online translation web interface. The platform also allows for integration with the translation memory systems SDL Trados and MemoQ via plug-ins.

The user interface integrates the open source Moses SMT system and thus frees the user from the technical tasks of downloading and installing these tools. This is an attractive feature for all users, but especially from the perspective of a research and education unit such as the Centre for Language Technology at the University of Copenhagen. Our goal was to avoid introducing technical difficulties to students, and still be able to teach them the basic principles of SMT with a real hands-on component. Additionally, the fact that the system is cloud based was in its favor because the students could access the system during class from their own laptops without involvement from the university IT department. For these reasons, we decided — a bit late in the semester and as an experiment — to introduce a small Let’sMT! module into already ongoing machine translation (MT) teaching plans, both for bachelor-level students and masters students. The purpose was to gain experience by using this platform and to assess the adequacy of the platform as a way to improve the students’ learning.

The main focus of Let’sMT! is to offer a translation platform that allows for user-provided content to build MT systems from scratch by using state-of-the art SMT technology. Users may sign up and use the platform as-is, with already existing data and systems, or they may build their own data repositories and translation systems. From a technical point of view, it turned out to be easy for the students to build translation systems and to use them.

The training data available for users in the Let’sMT! repository consists of large and well-known publicly available corpora such as Europarl, DGT-TM Acquis Communitare and the Opus corpora, all of which are often used for SMT systems. In addition to these resources, the platform also offers domain-specific data for several under-resourced languages. Currently the data covers nine different subject domains: biotechnology and health, education (Figure 1), electronics, environment, finance, IT, law, tourism, and national and international organizations and affairs. The focus was on the ten languages within the Let’sMT! consortium: Croatian, Czech, Danish, Dutch, Estonian, Latvian, Lithuanian, Polish, Slovak and Swedish. Parallel data with widely spoken languages such as English and the Let’sMT! languages exist.

Users are also allowed to upload their own domain-specific parallel and monolingual data to the Let’sMT! repository. Here data is converted, stored and prepared to be used to train standard SMT systems. User data can be classified as private or public. If they are uploaded as private it means that only persons or organizations that the user has selected will be able to see the data and use it. If they are uploaded as public it means that all other users may also see and use the data. The uploaded data and trained SMT systems cannot be downloaded from the platform for reuse in another system or in another context. They can only be used within the Let’sMT! platform.

From a web interface, registered users can perform cloud-based system training of domain-specific SMT systems based on data uploaded to the Let’sMT! resource repository. The data can be any combination of resources uploaded for public use on the platform or resources uploaded by the user for private use only. The data selected for system training can be parallel as well as monolingual, and it can be in-domain or out-of-domain. The system training is carried out based on the Moses SMT software using in-domain and out-of-domain data weighting. Finally, the trained systems (Figure 2) are evaluated on either a user-defined evaluation corpus or on a system-generated evaluation corpus, using widely accepted evaluation metrics such as BLEU, NIST, TER and METEOR.

Cloud-based teaching

We regard our students as future advanced MT system users, and in addition to learning about the theoretical aspects of MT, we also want them to try out concrete systems. The purpose of this is not to teach them which buttons to press, but for them to learn what MT is, which different types of MT approaches are available, how these systems differ from one another in use and output, and what level of knowledge and skills they require from their users. So we want them to study MT in both theory and practice in order for them to become competent users with attractive skill profiles after their graduation, whether they aim at jobs as translators, translation planners, technical support staff, system developers or something else.

We discovered that Let’sMT! is well-suited for active student participation. Usually SMT hands-on exercises require scripting and programming skills, which are not the main focus for BA students from language studies. Let’sMT! allows the students to get hands-on experience, both with the training process and with the use of aligned data, without time consuming technical tasks.

Let’sMT! also offers free online user registration, which makes it possible for the students to sign up during the course. As users of the platform, they can translate texts with the available public systems, they can train systems using less than two million parallel sentences as training material, and they can also translate using their own systems. However, a free user account does not allow for the uploading of training data; this activity has to be executed from a license-fee account. The licensed teacher of the SMT course therefore has to upload extra training data for the students when relevant.

It is important to teach the students the role of data, because the linguistic knowledge in an SMT system is built on the quality of the training data. Often large, well-known corpora such as Europarl are used as training material, but the use of these corpora does not give a student a good understanding of the balance between data quality and data quantity. The selection of proper training resources is especially important for under-resourced languages and for translation of texts from specific domains. Uploading parallel corpora as training material is easy for a large number of formats. The platform offers automatic format conversion and alignment of the data, and the user can inspect potential warning messages from the upload process and see the sizes of the resulting aligned corpora. In the web interface, the workflow for training an SMT system is broken down into a number of steps (Figure 3). The progress of an ongoing training process can be followed by inspecting a training chart that gives information about both the complexity of the training process and the flow of processes. In addition, it shows the current state of the training process in light blue. This modular way of specifying the needed information for the training tasks leads the students through the process in a structured manner. Help texts are available, and the user gets hints and feedback if chosen options or corpora deviate from recommended usage.

The platform automatically calculates scores for the most common evaluation metrics. Students can investigate the metrics and the translation quality by downloading the evaluation corpus and the resulting translation.

The options to integrate Let’sMT! into translation memory systems can be used to show the students an example of a professional translation workflow.

Outcomes of the experiment

The BA students used a small SMT system trained with about 20,000 bilingual sentences (English and Danish) and 20,000 monolingual sentences all within the subject domain of education. The output from this SMT system served as part of a comparative analysis, the aim of which was to get the students to understand the interdependency between types of training data, types of input text and the final translation quality. The students saw that the in-domain system had better performance with respect to finding the correct domain specific terms, especially concerning the translations of technical terms — although due to lack of coverage, it sometimes failed in terms of unexpected translations. This experiment made it transparent and comprehensible to the students that the type of training data and text are interrelated when it comes to translation quality.

In a questionnaire the BA students were asked to complete at the end of the module, the students were quizzed about the types of training data required, as well as the correlation between training data and input text. Their assessment was given in the form of scores on a scale from one to five, and the average score was close to four for all the questions asked. We are very satisfied with this result.

The more advanced MA students were able to spend less time with the system, and for them we wanted to find out whether they would find Let’sMT! appropriate for creating their own experiments. In other words, they were training systems with different data sets and languages, and we wanted to see whether they would find the system suitable for evaluation tasks with different evaluation metrics. They were also asked to fill in a questionnaire, not with scores but with written text responses.

The MA students reported that Let’sMT! worked well as an initial introduction to SMT and would have worked well at the beginning of their course. They saw it as an easy way to follow the different steps of how an MT system is created and works, and they found that especially the training chart gave a good and pedagogical overview. They also found it easy to train a system with the guidance offered on the platform. They found it very useful to be able to try out the training and use of systems with different types of data and language pairs to see which ones to use in a project. In addition to these observations, they also found great usefulness in the easy access to a range of evaluation scores. Some said that they would definitely use this function in future work. Thus, we concluded that this platform developed for professional use in localization scenarios is also a very adequate teaching and learning tool, not least because of its easy cloud-based access.