Community Lives: PanLex - A database for extreme localization

Community Lives: PanLex – A database for extreme localization

By Jeannette Stewart February 19, 2016

Every translator knows that languages tend to surpass complexity and wander into the wilder terrain of the complicated. If languages conformed to the rules of statement logic, then translation would be a snap. Grammar, spelling and meaning would not present the difficulties that all language professionals must struggle to master. If such logic exists, it’s very coy about uncovering itself. Instead, we all must grapple with tasks of rendering source languages accurately in target languages within cultural context. Add the power of intelligent computing along with the multilingual mix of today’s globalized culture and the challenges multiply. At the heart of this rich mix of linguistics, technology and applied science, PanLex is rising to these challenges with the startling ambition, in their own words, of harnessing the power of “extreme localization.”

PanLex is undertaking an ambitious project with a vision to enable “panlingual globalization” by providing a database of “expressions” in the form of word groups for all the world’s languages. The concurrent availability of an equivalent word or expression in all languages subtly shifts the focus of translation from process to a state of parallel existence. In the words of PanLex’s vision statement, “any language is translatable into any other language.” According to their theory, this is achievable in the globe’s 7,000 languages using “pairs of expressions” of which their database is already in excess of one billion and growing.

This monumental task clearly requires scrupulous organization and control. First, existing data, lexemes — words and phrases — are acquired for each language, using a variety of sources such as dictionaries, glossaries, thesauri and so on. But the origins of this data are kept and acknowledged, and the resulting content is gathered in their database. Essentially these are collections of vocabulary without syntax. So where does extreme localization come in? What is distinctive about PanLex is its coverage. For the translator of well-documented languages such as German and Spanish, PanLex is yet another resource documenting semantic equivalence of expressions but does not document in much detail the conditions qualifying such equivalence. The “extreme localization” that PanLex supports is extreme in how many languages it supports.

If we look at the translation process through the eyes of a database designer, it could be seen as reducing a one-to-many relationship to a one-to-one relationship. When translators work on a text, they exercise highly-educated choices in rendering the source into the target. One source lexeme; one target lexeme. What PanLex offers is the more dynamic relationship of any-to-any, allowing lexemes in different languages to form a symmetrical relationship, which cleverly accommodates variety. In itself, this is a powerful translating apparatus. However, this can be taken a step further by allowing inferences to be made for missing translations, a necessary facility when organizing vast amounts of data. As PanLex calculates it, 7,000 source languages x 100,000 words in each x 7,000 target languages = 5 trillion translations.

At the sharp end of the translation stick, a translator works first and foremost by processing data. But what data process should be used? PanLex uses lemmatization. Thanks to computational linguistics and the advent of search engines such as Google, the process of lemmatization as opposed to stemming has become the object of serious and productive work. Whereas stemming uses heuristics, rules of thumb, to reduce the number of inflectional forms of words with similar meanings (democracy, democratic, democratization), lemmatization seeks to use the base, dictionary or reference form of a word. Lemmatization is a more rigorous approach to morphological analysis, and hence PanLex manages to achieve broader language coverage than other related database projects. Lemmatic data tends to be available for more languages than data on inflectional paradigms, enabling PanLex’s database of lemmatic expressions grouped by meaning to cover the largest possible set of languages. The PanLex project focuses on procuring and maintaining data, a monumental task in itself. Access to researchers and developers is live through an API and through monthly-generated snapshots.

The project started in 2004 and through the years had different houses and objectives before it morphed to its current status with the Long Now Foundation. Through all the changes there was always one constant, the man who started it all, Jonathan Pool. A Harvard alumnus with a PhD in political science from the University of Chicago, Pool is a successful entrepreneur who used his own money to support his vision of developing an infrastructure for interaction between all the languages in the world in order to break communication barriers.

In 2004 Pool established Utilika Foundation and financed it by donating the assets of his then business Centerplex in the form of land and buildings. Centerplex, established in 1990, was an innovative company offering its building space for rental to small local businesses and providing them with all the latest technology and tenant-friendly terms. When the real estate industry changed drastically to make Centerplex’s model difficult to sustain, Pool decided to move on and put all his energy into materializing an even more ambitious vision, “universal interactivity.” Pool was joined in the Foundation by a few more like-minded people, namely Emily Bender, professor of linguistics at the University of Washington, as well as Christine Evans and William D. Lewis, also affiliated with the University of Washington. Together they sought to advance communication and collaboration among diverse human and artificial agents by means of applied research on languages. Pool sought various partners before entering into a support agreement with the Turing Center. The Turing Center is a multidisciplinary center at the University of Washington established in May 2005 with a multimillion-dollar gift from the Utilika Foundation, which was augmented by federal research grants and contracts as well as support from corporations and other private foundations. The Turing Center, under the directorship of artificial-intelligence pioneer Oren Etzioni, has undertaken many research endeavors, all of which can be reviewed in their numerous publications. The center’s research produced a lexical database, TransGraph, designed to support panlingual translation and a more powerful extension, PanDictionary, based on intelligent automated inference. Another successful project was PanImages, now defunct, which allowed users to input online search arguments for images in their native language. However, instead of restricting searches to images labeled in that language, PanImages translated them and returned results for images labeled in other languages too. The potential power these projects may give users to enjoy the riches of our vast web-life speaks for itself.

Utilika Foundation and the Turing Center determined by 2009 that although the research had demonstrated the value of the project, its future was based on enlarging the database to cover as many languages as possible. This new objective diverged from the Turing Center’s primary research goal and Pool decided to search for a new partner.

Pool soon discovered The Rosetta Project, which has a similar goal, the global collaboration of language specialists and native speakers working to build a publicly accessible digital library of human languages. Utilika Foundation and The Long Now Foundation agreed to join forces, and The Long Now Foundation is currently the fiscal sponsor of the PanLex Project. This has relieved the project of some administrative and compliance obligations as well as provided more public exposure. Utilika Foundation transferred its assets to The Long Now Foundation to support the project and has since dissolved.

The PanLex Project pursues its mission with a small team of programmer-linguists in Berkeley, California, headed by linguist David Kamholz, enhanced by select volunteers around the world. These volunteers discover manuscripts and small-run publications containing vocabularies of little-known languages. When no such documents exist, volunteers elicit lists of words from native speakers and make them available to the project. The project also employs summer intern students of computational linguistics who get training in its techniques while helping turn the PanLex vision into reality.

In principle, the task of putting new data into the database is simple, but in practice this task is beset with many complications. Sourcing accurate lexical translations between thousands of languages and dialects is a big enough challenge. Designing computational methods to make the sources accessible and efficient is another. For example, a source entry for the Tamil word: பிரிந்துவிடு, which makes sense to a human reader translated as “be estranged, divided, at a distance, isolated, cut-off” is incomplete for an automated system, which will mistreat the results thanks to the ambiguous translation. The translation should instead be the repetitive but accurate “be estranged, be isolated, be at a distance, be isolated, be cut off.” The PanLex team of editors deals with these complexities by assigning a quality rating to all content sources and ensuring that they meet a number of criteria, such as comprehensiveness, tractability, availability and high quality. These criteria can conflict, so editors must exercise their own judgment. For example, online sources may be tractable but of low quality. PanLex has experimented during the last few years in finding the most efficient ways of acquiring and adding new data. Despite the project’s effort to prioritize low-density languages, there is a large difference in language coverage within the database.

Even with all these complexities, the PanLex Project and its future applications remains a vision worth pursuing. Pool today believes that “although the available resources may not permit the full realization of the project’s objective, a more pragmatic version of the goal is to develop the size and quality of the database far enough to give it undeniable value for machine translation, search engines and language revitalization.”

From localization to internationalization to globalization to universal translation, the development path that the multilingual community is following is dizzying. The enabling technologies, methodologies and business applications that manage them form a multidisciplinary cluster of practices that demand cross-fertilization in order to be productive. As we rush pell-mell into a new, networked web of intelligent technologies, the challenges facing us require enterprise, zeal and colossal resolve to meet them. PanLex is a stellar example of the initiative and ambition we need to survive in a world of seamless, instant communication.