System functionality in language technology

By Andrzej Zydroń December 18, 2012

The last ten years have seen some real practical advances in translation technology: statistical machine translation (SMT) and effective collaboration using server and cloud-based translation management systems.

Both increase productivity and reduce the amount of time it takes to complete a given localization project. This same period has also seen great advances in IT, namely the universal adoption of XML as a vocabulary for all aspects of interchange and data definition, as well as big data, the ability to store and process vast quantities of information, and to mine it in order to benefit from this immense resource. Hand in hand with these advances, we have seen a constant and almost remorseless rise in internet connection speeds and geographical penetration.

XML is now ubiquitous: from Apache OpenOffice, Adobe FrameMaker and Microsoft Office through to all aspects of software engineering including web services. The vast majority of all translatable data is now in XML from Adobe InDesign through to Word 2007 and above, and if it is not in XML, then it is very easy to convert it into XML and back. XML and Unicode have liberated the localization industry from the proprietary file formats and character encoding hell that existed before. The only issue now is to understand XML as a native format, which is something that continues to perplex many computer-assisted translation tools.

We have also seen the publishing of required standards for localization, and not surprisingly they virtually all are described via an XML vocabulary, from TMX (Translation Memory eXchange) to GMX-V (Global information management Metrics eXchange Volume) to XLIFF (XML Localization Interchange File Format). All of these standards are encompassed in the OASIS OAXAL (Open Architecture for XML Authoring and Localization) reference architecture.

So where do we go from here? We seem to be reaching the limits of what can be achieved with SMT. The European Commission Moses project has spawned diverse commercial implementations. The issue with the many commercial offerings of SMT is both how to accumulate the required volume of aligned text and to develop the best delivery mechanism. The only practical vehicle for delivery is via integration with computer-aided translation tools.

SMT must be viewed within the context of its value to translators. At best it is an aide that can be used for monolingual translation, which is a euphemism for SMT post-editing by translators or editors with only a cursory knowledge of the source language. In essence, SMT offers more fuzzy matching, and there is a real ceiling on quality past which any further increases in the aligned language store offer virtually no incremental benefits.

Any further advances will require a significant change of direction — a move back to syntactical analysis and a hybrid approach incorporating rule-based translation and an abstract notation to denote meaning, thus allowing for multiple language combinations. There is much to be done in this field, and XML should be at the forefront of establishing the required formats. To be practical, this will require a significant input for each language in terms of the Princeton WordNet and associated Global WordNet. In order for this to prosper, an open licensing format will be required as well as a great deal of work. This is an ideal task for the academic community sponsored by industry and governments.

The architecture required to tackle this must be open standards-based, using XML vocabularies right down to the syntactical level. A distributed server based on web services components and a Big Data approach of the type used by companies such as Google, Yahoo! and Facebook is paramount.

The imminent challenge facing the localization industry is not prima facie that of increasing the productivity of the translator. The take-up within the translation industry of translation tools has also been very slow, due mainly to high cost and rather poor functionality. Translation typically accounts for less than 30% of the total cost of localization. The challenge is rather eliminating manual processes and replacing them with a high degree of automation. With over 80% of the localization industry being made up of small and medium-sized enterprises, the provision of automation will provide significant benefits. To date, such facilities have only been available to the very large localization companies. However, the advent of cloud systems now provides a low-cost way for any localization company to potentially benefit from advanced automation features and technologies such as portals and web service integration with customer systems. To be truly effective such cloud offerings need to be based on open standards to insure that companies are not locked in to their supplier.