Translation in the blogosphere

How can translation automation be tapped for the online world of blogs to ensure that information and ideas circulate freely in dark times? Tim Oren has been pushing the debate on this issue of ‘cultural bridges’, which brings together some of the threads in my own blog. The dilemma is clear enough and not new: are people prepared to work happily with machine translation output since it is better than nothing, or will the quality of that output and the attendant noise damage the actual conversation they are trying to have?

As far as I understand it, the background to this debate lies in the emerging power of blogging as social software and a personal publishing platform. There is something of that early web excitement about information trying to be free, of cyberspace being exterritoriality incarnate. See reports on the

Global Voices event and here for the broader concept of blogging as the new communication back-channel.

Now that millions of individuals around the world are being empowered, if that’s the right expression, by blog software to file reports from sites that the news media cannot always reach, or expose and share their personal lives in interesting new ways, or engage in intense discussion about issues of moment, far from the exclusive institutions of the nation state, the language barrier suddenly looms up like a wall of ice from the sea of words. How can we free the meanings encoded in those alien tongues?

There are of course dozens of free automatic translation services (that usually operate on very short texts) on the web, as well as a growing shelf-full of low price (but higher volume) solutions. But Oren seems to think that the blog community should have its own (open-source) translation solution, that avoids the awkward interfacing and quality constraints of existing web automatic translation solutions. His proof of concept, interestingly comes from the perhaps little known experiment on CompuServe in the mid-90s where users in the World Community Forum would converse both through and around the act of automatic translation, suggesting that where there is a will to communicate there’s usually an way round the problems.

In other words, is there an opportunity here to grow a grass-roots translation automation project? Precedents would include the infamous EU’s Eurotra project, which failed in its attempt to implement a total many-to-many translation infrastructure but nevertheless seeded the EU with a useful set of translation engineering centers for some 11 languages. Eurotra was largely predicated on a pre-web highly bureaucratic conception of project management. Perhaps a new automatic translation project for the blogosphere could be set up and managed as a Linux-type bazaar initiative, rather than a cathedral-like Eurotra. Especially since time is of the essence.

Closer to the spirit of freedom is the ongoing UNL project, which still seems to be known only to the cognoscenti, yet if successful, is set to become a genuine linguistic infrastructure for a networked world. UNL is the brainchild of Japanese experts and is being developed by a multitude of local language computational linguistics labs around the world. The UNL system is basically rule driven (statements are encoded into a UNL lingua franca semantics for later decoding into the local language), rather than data-driven (i.e. translations which are derived from statistical facts about existing databases of bilingual texts), and the language spread is heavily Asian-language oriented. UNL patented its technology earlier this year, so perhaps it would not suit the blogosphere’s mindset anyway. But it might be worth checking out whether the UNL protocols could slip neatly into the new blog space.

One interesting aspect of the UNL system is that the writing process itself can be adapted to enhance the translation process (known as dialog-based machine translation). This has the advantage of using human ingenuity to tweak text into something the universal semantic interlingua understands. Your blog would, as it were, be automatically translated into and stored as a hidden esperanto as you write, awaiting the call to be decoded into any available language. However, the work of encoding the spontaneous, fast-changing idiom of blogs and emails might defeat the very aims of the exercise.

Other projects that currently deal with parts of a potential translation infrastructure include the multilingual Wikipedia, whose article set could provide a multilingual resource for some sort of translation engine to work on, and the

Free Dictionaries project . This embryonic effort aims to produce a huge range of dictionaries through grass-roots lexicography, again driven by the availability of willing helpers working in a community.

Ideally, the development of such piecemeal dictionaries over many languages might be a useful asset to an online translation system, provide the databases are easily accessible as an online resource for different applications. Coverage will always be a problem (languages keep growing), as is lexical customization (a word is used in different contexts with different meanings – hence the essential need to disambiguate). But the plan to span very many languages might forge a handy learning environment in best practices for future generations of dictionary makers.

The standard criticism of this sort of amateur language work is that the data will lack proper quality assurance, that the information may not be correct, and that more fundamentally, open sourcing is not much good at bringing very complex projects to fruition. On the first counts, I imagine the best response is that the very openness of the approach means that in the end it acts as a self-correcting mechanism, and that enough eyes looking at linguistic data will eventually be able to iron out most errors and absurdities. As for organization, time will tell. If collaborative tools can be developed that help solve some of the organizational problems of very large horizontal projects, then some of these handicaps will disappear. The trick will be to persuade enough bilinguals to kick-start a project that is in a real sense the very paradigm of open sharing: after all language itself is the site of our deepest identity, as well as the prime vehicle through which we access the identities of others.

Andrew Joscelyne
European, a language technology industry watcher since Electric Word was first published, sometime journalist, consultant, market analyst and animateur of projects. Interested in technologies for augmenting human intellectual endeavour, multilingual méssage, the history of language machines, the future of translation, and the life of the digital mindset.


Weekly Digest

Subscribe to stay updated