Language on the web

By Christian Lieske September 26, 2012

Web-induced change such as free online machine translation (MT) has changed the language industry. The change continues, possibly even toward an all-time high, as several technologies and initiatives are examining language on the web.

One way to look at language industry issues is to perceive the industry as comprising three dimensions: services offered by a supplier, processes and technologies behind a service, and language itself. The web has already influenced each of these dimensions. The ways suppliers and requesters do business with each other has, for example, changed due to e-commerce. Core translation processes use web protocols such as HTTP to update centralized translation memories. Neologisms related to the web are omnipresent, even if not everyone e-mails, googles or twitters.

The next wave of web-induced change for the language industry appears to be driven with some help from its own constituencies. Players involved in natural language processing and web internationalization are investigating and advancing the web’s capabilities to support high-quality, large volume, effective and efficient processing in multilanguage contexts like translation. A point of convergence related to the possible forthcoming changes seems to be what’s currently termed the multilingual semantic web.

One possible approach for understanding the multilingual semantic web is to look at it from two complementary angles — namely that of the multilingual web on the one hand and the semantic web on the other hand. The multilingual web relates to all aspects of creating, localizing and deploying the web multilingually. Over the past two years, a network (multilingualweb.eu) of approximately 20 partners funded by the European Commission and coordinated by the World Wide Web Consortium (W3C) organized a series of four events to look at best practices, standards and possible gaps in this area. The formation of the W3C Working Group MultilingualWeb-LT resulted from the success of the network. The working group will develop standardized metadata for web content to seamlessly interact with language technologies. The semantic web relates to content that is easier to interpret by automated processes — in a nutshell a set of representation, coupling and reasoning techniques that involve, for example, unique identifiers for language-neutral concepts and simple statements that describe resources (“P is a property of R”). One particular strength of the semantic web is its built-in capabilities for pulling together related facts. The Multilingual Semantic Web event in September 2012 recently intensified the contact between the multilingual and semantic web communities, examining the intersection of natural language processing, MT, multilingual information and knowledge access.

One intersection of the multilingual web and the semantic web is the area of linked open data. With linked open data, you enter the world of huge, freely available collections of statements and facts. Most of this comes straight from relational databases, and is to a smaller degree derived through natural language processing from wiki pages and other structured or semi-structured data. The collections are highly interlinked and thus form a cloud of linked open data sets. One of these data sets includes billions of statements generated from the English Wikipedia, for example. With linked open data, you also enter the territory of connecting data sets in different languages, and of languages for which sufficient linguistic resources for statistical natural language processing do not yet exist. Overall, you end up with a scenario in which there is a need for cross-lingual processing similar to automated web-based translation as it is known to the public today. The new dimension originating from the overlap of the multilingual web with the semantic web (and linked open data) means that concepts, as opposed to strings, take center stage. Translation is no longer the single step of “What is the string in target language T?” but rather the step sequence “What is the concept identified by source language label L?” and then “What is the target language label for the concept?”

The language industry, academia and others are all involved in investigating and advancing the state of affairs related to language on the web. Whoever wants to win the game may want to support ongoing initiatives that address multilingual open linked data for enterprises or multilingual web language technologies.