Inside the brackets

People delivering products and services in the area of translation, multilingual content management and the like need to be able to appreciate the real importance of the debate raging about the semantic web, which increasingly looks like a another nerds’ battleground. One problem is: how do we build the metadata and/or taxonomies – the bracketed tags – needed to help documents and services bind automatically and seamlessly into useful resources. Top down, through small technical committees who set standards or bottom-up through the vast distributed contributions of millions of users aided by simple aggregation techniques?

The solution chosen – perhaps the fudge of solutions – matters because language issues emerge at every step of the way. Will multilingual taxonomies need to be built, or will they emerge mushroom like from mega-scale group practices, prompting in turn the offer of further taxonomy aggregation services as a niche market? Should public money be thrown at large scale semantic web R&D projects that slide into obsolescence before they are completed, or should it be channeled into supporting focused language technologies for minority communities that tend to be crushed by the invisible hand of the market?

This comment is from Peter Norvig, Director of Search Quality at Google speaking at SDForum’s Semantic Technologies Seminar. He highlights the granularity of the language problems confronting web searching – it’s the data and whtehr it’s been spell-checked, not the metadata, stupid. Interestingly, he is mainly a bottom-up guy when it comes to solving the who does what issue of the semantic web:

Semantic technologies are good for essentially breaking up information into chunks. But essentially you get down to the part that’s in between the angle brackets. And one of our founders, Sergey Brin, was quoted as saying, “Putting angle brackets around things is not a technology by itself.” The problem is what goes into the angle brackets. You can say, “Well, my database has a person name field, and your database has a first name field and a last name field, and we’ll have a concatenation between them to match them up.” But it doesn’t always work that smoothly.

Here’s an example of a couple days’ worth of queries at Google for which we’ve spelling-corrected all to one canonical form. It’s one of our more popular queries, and there were something like 4,000 different spelling variations over the course of a week. Somebody’s got to do that kind of canonicalization. So the problem of understanding content hasn’t gone away; it’s just been forced down to smaller pieces between angle brackets. So there’s a problem of spelling correction; there’s a problem of transliteration from another alphabet such as Arabic into a Roman alphabet; there’s a problem of abbreviations, HP versus Hewlett Packard versus Hewlett-Packard, and so on. And there’s a problem with identical names: Michael Jordan the basketball player, the CEO, and the Berkeley professor.

Andrew Joscelyne
European, a language technology industry watcher since Electric Word was first published, sometime journalist, consultant, market analyst and animateur of projects. Interested in technologies for augmenting human intellectual endeavour, multilingual méssage, the history of language machines, the future of translation, and the life of the digital mindset.


Weekly Digest

Subscribe to stay updated

MultiLingual Media LLC