Do-it-yourself MT

As demand has grown rapidly for customized machine translation (MT) solutions, so has demand for more user control. Once financially restricted to global enterprises with serious budgets, customized MT has become an accessible tool for the translation community, though there is still resistance among some parts of the community, which presents an obstacle to mass adoption. Four businessmen tackle this topic and share their knowledge about the different levels of user control that can be achieved in a do-it-yourself environment.

 Early innovators built the case for rule-based systems, which were quickly adopted by organizations that could see the potential and were happy to spend time and money investing in knowledge to gain the early benefits. The early majority then sat up and took real notice as statistical machine translation (SMT) systems were developed. Moses opened up the field, though it is still considered by some to be the domain of those with know-how and a budget in their favor. In recent years, however, there has been a surge of demand for systems that are customized and accessible, and user-friendly, customized MT that has driven the early stages of the do-it-yourself MT boom. “Yes, there are still those who want to know the full workings behind their customized MT solution, but the mass adopters don’t want or need to know everything; they simply expect a solution that can be implemented quickly and produces great results,” said Gavin Wheeldon, CEO of Applied Language Solutions.

Wheeldon understands that while a one-size-fits-all approach to MT doesn’t cater to everyone’s needs, making MT accessible to the masses is critical for the evolution of the translation market. The cost models vary considerably, but there is one common objective that all vendors are focused on: achieving quality that generic engines such as Google and Bing can’t realistically compete with. Also, whether through a fully customized solution, with the support of experienced experts, or through a slightly less sophisticated do-it-yourself personalized solution (Table 1), MT should be available to everyone rather than only those buyers perceived to have bottomless budgets. “Customizing anything is invariably going to be better, but it is not possible for most organizations, and more generic solutions still deliver fantastic results,” said Wheeldon.

While many are still not in the position to build and deploy their own customized engines using Moses, the desire to contribute thoughts and demand innovations regardless of the level of involvement in the design and build is already evident. In June 2011, Applied Language Solutions surveyed translators, language service providers (LSPs) and enterprise companies to establish what they expected from MT. The results indicated that even those not utilizing MT within their translation workflow at that time still expected any proposed solution to enable user control.

Manuel Herranz, CEO of Pangeanic, had a similar experience when consulting potential users: “Some of our big LSP accounts were just desperate to embrace a translation automation strategy that accelerated and reduced costs for their multilingual content production life cycles. Increasing translation productivity through translation automation and post-editing MT output was to be the answer. But what MT system could really be of use? We and our customers were in need of customized, fully-tailored solutions. We could not afford the time or the cost to add hundreds of syntax or lexical rules to existing systems. After evaluating commercially available MT systems and learning about MT, moving forward to stats-driven MT development and consulting simply had to happen.”

Tom Hoar, managing director of Precision Translation Tools (PTTools) echoed the opinion that MT needs to be accessible to a wider user group. “We had one simple goal around MT,” he explained, “and that was to create a new MT community. This community would then benefit from academic innovations in MT while enjoying the freedom to explore their own imaginations. Over time, the community members would then be in the position to contribute their own innovations.” The strategy behind PTTools allows more user control through the development of Do Moses Yourself (DoMY). The Community Edition of DoMY promised users a complete, fully functional, free Moses SMT platform in just a few hours, with the user manual including basic information about SMT to enable the user to create translation engines using their own data. The tools themselves, of course, aren’t enough for any user to create translations any more than installing Microsoft Office creates an accountant or a public speaker by installing PowerPoint. Removing the barrier to Moses, however, could help create an entirely new category of MT users, by enabling them, educating them, enhancing the tools at their disposal and enrolling more users to support the community.

Herranz agreed with the tool-empowerment approach. “The do-it-yourself breakthrough of last year and our decision to launch our own PangeaMT platform is precisely the result of constantly listening to our customers and heterogeneous user base.”

“Nobody is suggesting for a minute that empowering users will turn them into SMT experts overnight — far from it,” added Wheeldon. “What it does allow, however, is for the user to build an engine using existing translation assets in a short period of time, producing good quality output at a fraction of the cost.”

Hoar went into more detail about user education for those who do intend to invest time in learning about SMT to build their own systems. “As technology transitions from ‘bleeding edge’ to ‘leading edge,’ it often advances faster than users’ skills. Such is the case with SMT. Therefore, beyond having the tools, users must learn the concepts and develop the skills that make SMT possible. When working with DoMY, users quickly learn that although reusing translation memories (TMs) can create good MT engines, TMs alone may not be enough to generate great translations.”

As late as 2010, experienced IT engineers often invested days or weeks to resolve undocumented problems when installing Moses components, and this is something that DoMY addresses. But not all users would opt for this amount of intervention or simply don’t have the desire to learn how to build upon Moses themselves. Many are happy using a platform that enables them to simply build MT engines, based on the existing expertise of the vendor. There is, however, one area that remains the responsibility of the user, and this is the selection of their own translation assets used to customize the engines. 


Data in a brave new world

Regardless of how sophisticated the MT systems become, the same basic rule will always apply: bad quality in will equal bad quality out. However, even if the quality going in isn’t great, poor quality training data can still be cleaned considerably using automation, meaning that SMT is not restricted to the data-rich. Wheeldon was adamant that there are no better potential mass adopters of do-it-yourself MT than translators themselves, based on how well they know their training data and therefore the output that can be achieved with minimal intervention from the experts. “The vast majority take huge pride in keeping their TMs and glossaries up-to-date and clean, both of which are critical to the level of quality that can be achieved when building an engine. Quality over quantity most definitely applies when dealing with SMT and blindly adding more data is not a guaranteed recipe for success. On one occasion recently we threw away 1.2 million sentence pairs of a client’s data prior to building a customized engine for them.” Data cleaning must be applied, regardless of how the training data was generated and in SmartMATE, for example, this includes the application of a large number of pre- and post-processing rules. 

Enhanced MT performance can also be achieved by choosing the more hands-on approach of DoMY. Users address and resolve advanced topics such as appropriate hardware selection and repurposing training data. As a result, several DoMY users have invested less than $2,000 in new hardware that accelerates the training and tuning of MT engines to hours instead of days. Other users learned that operating an MT system does not match their core competencies and switched to online self-serve or full-service companies in the new MT community. This is where the do-it-yourself solutions that don’t require detailed user knowledge come into play.

There are clearly different levels of user intervention being demanded within the market and it isn’t possible to categorize this by size or value of the user company. It is possible to provide fully customized solutions for large global organizations that have no desire to understand any part of the MT process other than the end result and quality scores.

“The web is populated with myriads of specialized software services, giving enterprises and application developers the freedom to mix and match them into applications that fit their needs. Do-it-yourself MT is part of this new paradigm,” said Tilde’s CEO, Andrejs Vasiļjevs. Tilde’s involvement in the European Union ICT-PSP Programme project, LetsMT! allows users to access an open platform to build customized MT systems using shared data resources. These resources include publicly available parallel texts, pre-trained engines and tools to use the MT engines in various production scenarios.

There is much debate in the public domain regarding data quality, data security and data sharing, all of which do need to be openly addressed in order to eliminate the uncertainty for potential users. Vasiļjevs understands the fears: “MT continues to make many industry practitioners wary. Many are concerned with practical questions such as privacy, confidentiality and cost. On the one hand, they are burning to try MT, and on the other, they are unsure how to start.” LetsMT! encourages users to try MT in a secure, cloud-based environment with a friendly user interface that gives them everything they need to create their own customized MT engine and put it to work.

Privacy is something that Applied Language Solutions is acutely aware of. “We learned very quickly from our initial testing group that data privacy needed to be clearly addressed, whether in the beta version of our SmartMATE terms and conditions, or in any e-mail communication where we were asking people to test the system,” said Wheeldon. “Covering it in our FAQs simply wasn’t enough given the level of protection that people want around their own carefully guarded data.”

Any do-it-yourself engine builder should be able to assume that no unauthorized individual or company can access or use its training data, but for many the data isn’t there or isn’t good enough, and this is where data sharing needs to be encouraged. Both Pangeanic and Applied Language Solutions have long been supporters of contributors to TAUS Data Association (TDA) and join other data uploaders including Dell, Oracle, Microsoft, Intel, Adobe and PTC in sharing hundreds of millions of words.

Herranz is a firm believer in safe data sharing and has been a public advocate of TDA for several years. “In 2009, PangeMatic, our internally used MT system, was born. This was possible due to a large extent to the extensive data pooling from TDA. That same year our first translation engines were commissioned by our strongest corporate clients, with some of them keen to share their own translation data and become TDA donors themselves through Pangeanic.”

Similarly, the LetsMT! consortium encourages users such as researchers and industry practitioners to donate their parallel data for public use. It provides vast preseeded publicly available data for MT customization that users can then complement with their own data. “An important concept to encourage the donation of data, however, is to restrict access to this data,” said Vasiļjevs. “Although the public data may be selected to train systems within the LetsMT! platform, it cannot be accessed for reading or downloading and can only be used within the training system. In this way, the consortium seeks to engage users in providing resources that can result in better MT engines, particularly for hard to resource languages.” Enterprises can also opt for private space on the LetsMT! platform, where they can safely and securely upload data and use it to customize their engine while keeping it private. It is this choice that many believe needs to be available to the users.

For those that are resource-starved, systems such as SmartMATE, and LetsMT! in the near future, allow TDA members to access data through an API from the TDA repository in order to train their engines. Herranz doesn’t believe data is everything, however: “We are keen followers of the data-driven MT approach, but it is essential to have the right MT core components and peripheral modules in place. In our own SMT approach we work hard to research and test new techniques related to Moses. We keep a flexible and open mind. We do not disregard some hybridization efforts for distinct languages, such as Japanese — a language we have made significant inroads with on our own and in cooperation with Toshiba Corp.”

The clearing road to mass adoption integration capabilities are critical to the success of customized MT services, whether managed by a vendor or hosted as a do-it-yourself solution. “Integration with a number of popular translation management systems and working together with tool developers makes the adoption of MT much less painful for the end user,” said Wheeldon.

Tilde is also keen for LetsMT! to be accessible in a similar way. Vasiļjevs commented, “Once you have plunged in to train an engine, you will also grasp the power and interconnectivity that cloud computing puts at your fingertips. The engines that you train become available to use through CAT plug-ins, widgets and APIs to use on demand. The system you have trained resides in the cloud and the results are hosted there too, meaning you can access the results anywhere, anytime. Integrating with CAT solutions such as SDL Trados means the user can realize productivity gains combining translation memories with MT suggestions. You can build it into web pages for cost-effective solutions when you need on-demand multilanguage support.”

And what of the minefield of pricing models that are now being promoted within the community? As with any technology evolution, the performance is increasing and the prices are falling, which is opening up the market significantly. Technology is all about automation and the ultimate in automation is the automation of automation — in this case a do-it-yourself, click-and-build MT solution. “One of the pain points and sad inheritances from the traditional translation industry is the willingness of many to keep on offering MT output-only solutions under word-based pricing tiers. This may look like a comfortable way of putting a value against MT, but it does pose a problem to those organizations that have large volumes of content to machine translate and, most importantly, are interested in knowing what happens behind the scenes with their engines and the data used to create then retrain their solutions,” said Herranz.

So what of the future? Vasiļjevs believes more tools and user control will become ever more important: “The next wave of developments will augment the existing mix and match list of tools with exciting new web services such as quality MT for smaller languages by integrated linguistic knowledge, terminology identification and pretranslation, automated extraction of language resources from the web.” For those who are passionate about MT, total user influence and control, the new MT community has become a reality. New users who are clear from obstructions and armed with tools and knowledge are liberated to innovate on their own.

The new providers are enthusiastic about encouraging the sharing of data, knowledge and experience. The choice is in the hands of the buyer, whether they simply pay for hosting and accept that their do-it-yourself engine output may sometimes require some post-editing, opt for the ability to build and tweak engines using Moses knowledge or outsource the whole thing to one provider. Choice brings competitiveness and we can expect to see dramatic changes, where license fees are no longer the only option for those wishing to utilize the fantastic resource that is do-it-yourself MT. Every user from a freelance translator to a multinational organization can now go to market and expect to be offered extremely attractive solutions.