Wikidata gets wordier | MultiLingual

Focus

Wikidata gets wordier

Christian Lieske

Christian Lieske is involved in SAP language technologies. He has worked with the World Wide Web Consortium (W3C) and has contributed to the European Commission’s MultilingualWeb initiative and standards such as the XLIFF. He has a formal education in computer science, natural language processing and AI.

Felix Sasaki

Felix Sasaki’s field of interest is the application of web technologies for representation and processing of multilingual information. He has worked for the W3C and DFKI on internationalization AI. He recently joined the German publisher Cornelsen Verlag as a content architect.

Wikidata gets wordier

Christian Lieske

Christian Lieske is involved in SAP language technologies. He has worked with the World Wide Web Consortium (W3C) and has contributed to the European Commission’s MultilingualWeb initiative and standards such as the XLIFF. He has a formal education in computer science, natural language processing and AI.

Felix Sasaki

Felix Sasaki’s field of interest is the application of web technologies for representation and processing of multilingual information. He has worked for the W3C and DFKI on internationalization AI. He recently joined the German publisher Cornelsen Verlag as a content architect.

W

ikidata is a nonprofit knowledge base that anyone can edit and use. Because of this, AI can be shaped to a certain degree by anyone.

Backed by the Wikimedia foundation, a vibrant ecosystem helps Wikidata to make a mark on modern content processes. Its coverage (56 million items in April 2019), intuitive tools for end users and powerful interfaces for programmers make it a versatile tool for a large variety of usage scenarios — such as knowledge discovery, content enrichment, terminology work and translation. In autumn 2018 Wikidata enhanced its capabilities to capture information related to words, phrases and sentences in many of the world’s languages.

The galaxy

A look at the start page of Wikipedia at www.wikipedia.org and Figure 1 shows Wikidata in the context of the Wikimedia galaxy. Often, discussions of the Wikimedia galaxy include entities that are not part of the galaxy in a strict sense. A significant example is DBpedia.

Wikidata is like Wikipedia because anyone can consume (read) or modify (write) it. The key differences from Wikipedia are that Wikidata stores information in a structured manner, while information in Wikipedia is stored mostly unstructured — the semi-structured info boxes are the exceptions to this. Additionally, there is only one Wikidata, while there are approximately 300 single-language Wikipedias.

One motivation for Wikidata is the possibility to avoid inconsistencies between single-language Wikipedias (see Figure 2). The idea is simple: entities and their properties (such as demographic data about a country) are stored only once in Wikidata. The different Wikipedia versions (for example, the German and the Chinese one) refer to this single source of truth for example via special Wikipedia templates. If a fact in Wikidata changes, all referring Wikipedia articles automatically reflect this change.

Wikidata’s license (Creative Commons CC0), governance model and collaboration opportunities hold the promise of synergies and shared quality control. It can therefore be considered a valuable tool for anyone involved in creating and processing knowledge/content. In scenarios where any knowledge can be shared, the need to operate a private knowledge infrastructure is reduced. End user tools such as Reasonator (see Figure 3) provide easy access to Wikidata information, including multimedia content such as images.

Words in Wikidata

Since day one, Wikidata Items could be associated with labels and descriptions in any number of languages. Accordingly, several general Wikidata tools for end users are related to language. Some examples are Ask Wikidata, which provides a chatbot-like interface to questions readers want to ask, and Wikidata Translate, which uses the “power of Wikidata to translate a term between two or more languages.”

Figure 2: Inconsistencies between Wikipedia in different languages. The population listed in the English version differs from the population listed in the German version.

Figure 3: Reasonator.

Additionally, the presence of ontological information (“subclass-of”) in Wikidata allows the generation of taxonomies and other knowledge organization tools.

Tools like Reasonator, Ask Wikidata, Wikidata Translate and Wikidata Taxonomy realize useful usage scenarios related to Wikidata. The full power of Wikidata, however, is only accessible via SPARQL. SPARQL is a family of standards for programs related to linked data and the Semantic Web — a concept that underpins Wikidata. The examples for the SPARQL query end user interface demonstrate this, and show how to work in domains such as medicine, computer science, art, history or sports. An interesting feature of the interface is the different options for visualizing results, including tables, diagrams, (for certain types of data) timelines, maps (see Figure 4) and so on.

“Wikipedia and Wikidata tools” explains how to gather information for a given item. Some examples related to language, which include conceptual knowledge, are:

Looking at the examples from the previous sections, one may wonder “Why in the world was there a need for enhanced lexicographic capabilities?” The brains behind the Wikibase Lexeme extension — the technology that incarnates the enhancement — put it like this: “The Wikibase Lexeme extension provides improved modeling for lexical entities such as words and phrases. While it would be theoretically possible to model these things using Items, a more expressive specialized model helps to reduce complexity, and improve re-use and mappings to other vocabularies.” A statement from Jorge Garcia (lead of the W3C Ontolex Working Group), made in the context of a seminar for the European Commission Directorate-General for Translation, on the topic “Linguistic Linked Open Data for Terminology,” captures the underlying modeling. In the linked data paradigm, any element of the lexicon can become what Garcia dubs a “first class citizen,” becoming the center “of a graph-based structure, which will allow for many other possible arrangements and views on the information. Linked Data has proved to be useful for language resources in general, particularly when it comes to terminologies and dictionaries.”

The 2018 Wikidata enhancement thus facilitates capturing information on words, phrases and sentences — for many languages, described in many languages. A major piece of this enhancement was the introduction of “lexeme” as a third so-called entity type (the existing ones being “item” and “property”). This entity type allows important features of lexemes (such as lemmas, forms or senses) to be captured easily based on the general entity type provisions for properties, qualifiers, references and so on (see Figure 5).

The contribution of lexeme-related information is in full swing. Nearly 45,000 lexemes (see Figure 6) ranging from what one could call rudimentary, to advanced (see Figure 7), to stunningly rich. A query to retrieve the “biggest” lexemes yields quite a number of lexemes with information for more than 100 features covering etymology, senses, grammatical information and more.

Figure 4: SPARQL end user query interface (map with sculptures in Paris).

Figure 5: General data model (left), and data model for lexicographic data (right).

Working with Wikidata’s words

While not all general Wikidata tools already recognize lexicographic information, some specific tools already do. Sometimes categories include editing, querying and visualization. Examples are Ordia and DerDieDas. Ordia that exemplifies Wikidata’s support for media objects such as graphics and sounds. DerDieDas is a grammar game that is available for several languages.

More sample queries (especially for getting started with your own experiments that specially target lexicographic information) include a bar chart of the ten languages with the most lexemes (see Figure 8) at http://w.wiki/3TF. You might also check out the number of lexical entries in Wikidata; the types of lexical properties in Wikidata; example statements for a lexeme; composition of an example lexeme; information about lexical forms for an example lexeme; and grammatical features of lexical forms for an example lexeme.

Since SPARQL is a technology for the world of web services, it can be used with any of the programming languages that are used to build web service to create anything from single purpose apps, to powerful, flexible solutions.

Figure 6: Statistics on lexicographic data in Wikidata (via Ordia).

The possibilities for language-related applications drawing on Wikidata seem endless. Some ideas of what to do are:

Match terminology for a domain (e.g. www.agilealliance.org/agile101/agile-glossary/ or www.scrum.org/resources/scrum-glossary) against Wikidata to find existing translations, for instance.
Match a text against Wikidata to (e.g. via https://tools.wmflabs.org/ordia/text-to-lexemes) get grammatical information for the text’s tokens or to get a gut feeling for the coverage of Wikidata in a certain domain (for one or more languages).
Get a list of all lexical categories or lexemes/senses related to a certain category for a certain language. For example, to build a list of tokens that should be processed in a particular fashion (e.g. ignored during indexing).
Have a chatbot answer lexeme-related questions.
Integrate with programming environment for kids (see http://dalelane.co.uk/blog/?p=3524 for an example related to machine learning and AI).

Where to?

Again, Wikidata has many application areas. Thus, it does not come as a surprise that a search in arXiv — a repository of preprints of scientific papers — provides some hints on machine learning and AI areas for which the use of Wikidata already is being investigated. Examples include human-bot collaboration and generating Wikipedia summaries for underserved languages.

Figure 7: Sample lexemes (various lexical categories) in selected languages.

Figure 8: Lexicographic information by language, generated via query.wikidata.org.

One Wikidata area that is picking up speed is related to import and mapping. Wikidata allows users to “map” data sets (see the Wikidata “Data Import Guide“, step 8). Among other things, this enables automatic content enrichment. As an example, Wikidata contains identifiers of the “Gemeinsame Normdatei (GND)” and links them to Wikipedia. The GND identifier for an author thus can be linked to his biography in Wikipedia. Discussions around this touch on terminology, and emoji, for example. Wikidata does already relate to terminological artifacts such as ISO 12620 and ISOcat and its successor DatCatInfo. An example is www.wikidata.org/wiki/Property:P2263.

Within the Unicode consortium, a discussion has been started to use the Wikidata numbering system (“QID”) to create a system of emoji encoding that lies outside core Unicode regulation (see http://twitter.com/jenny8lee/status/1123335017919336451 and www.unicode.org/L2/L2019/19082-qid-emoji.pdf).

Interested constituencies and individuals could become active in Wikidata in a number of ways. For example, they could systematically integrate data categories relevant to a certain domain into Wikidata, or adapt existing Wikidata data categories to the needs of that domain. Perhaps they might systemize the mapping between Wikidata properties and domain-specific data categories. Or perhaps they could explain the added value of mapping for a certain domain (for example, access to multimedia assets). The possibilities are numerous, and as varied as the language industry itself.

Back to Issue