ikidata is a nonprofit knowledge base that anyone can edit and use. Because of this, AI can be shaped to a certain degree by anyone.
Wikidata is like Wikipedia because anyone can consume (read) or modify (write) it. The key differences from Wikipedia are that Wikidata stores information in a structured manner, while information in Wikipedia is stored mostly unstructured — the semi-structured info boxes are the exceptions to this. Additionally, there is only one Wikidata, while there are approximately 300 single-language Wikipedias.
One motivation for Wikidata is the possibility to avoid inconsistencies between single-language Wikipedias (see Figure 2). The idea is simple: entities and their properties (such as demographic data about a country) are stored only once in Wikidata. The different Wikipedia versions (for example, the German and the Chinese one) refer to this single source of truth for example via special Wikipedia templates. If a fact in Wikidata changes, all referring Wikipedia articles automatically reflect this change.
Wikidata’s license (Creative Commons CC0), governance model and collaboration opportunities hold the promise of synergies and shared quality control. It can therefore be considered a valuable tool for anyone involved in creating and processing knowledge/content. In scenarios where any knowledge can be shared, the need to operate a private knowledge infrastructure is reduced. End user tools such as Reasonator (see Figure 3) provide easy access to Wikidata information, including multimedia content such as images.
Tools like Reasonator, Ask Wikidata, Wikidata Translate and Wikidata Taxonomy realize useful usage scenarios related to Wikidata. The full power of Wikidata, however, is only accessible via SPARQL. SPARQL is a family of standards for programs related to linked data and the Semantic Web — a concept that underpins Wikidata. The examples for the SPARQL query end user interface demonstrate this, and show how to work in domains such as medicine, computer science, art, history or sports. An interesting feature of the interface is the different options for visualizing results, including tables, diagrams, (for certain types of data) timelines, maps (see Figure 4) and so on.
“Wikipedia and Wikidata tools” explains how to gather information for a given item. Some examples related to language, which include conceptual knowledge, are:
- Taxonomy (via SPARQL) – all subclasses of computer science
- Taxonomy (via “Wikidata Taxonomy”) – all subclasses of Knowledge Organization System
- Items in context – computer science and its superclasses
- Domain specific word lists – labels for diseases
- Multilingual word lists (1) – labels for diseases in English and German
- Multilingual word lists (2) – translations of the term tuberculosis
Looking at the examples from the previous sections, one may wonder “Why in the world was there a need for enhanced lexicographic capabilities?” The brains behind the Wikibase Lexeme extension — the technology that incarnates the enhancement — put it like this: “The Wikibase Lexeme extension provides improved modeling for lexical entities such as words and phrases. While it would be theoretically possible to model these things using Items, a more expressive specialized model helps to reduce complexity, and improve re-use and mappings to other vocabularies.” A statement from Jorge Garcia (lead of the W3C Ontolex Working Group), made in the context of a seminar for the European Commission Directorate-General for Translation, on the topic “Linguistic Linked Open Data for Terminology,” captures the underlying modeling. In the linked data paradigm, any element of the lexicon can become what Garcia dubs a “first class citizen,” becoming the center “of a graph-based structure, which will allow for many other possible arrangements and views on the information. Linked Data has proved to be useful for language resources in general, particularly when it comes to terminologies and dictionaries.”
The 2018 Wikidata enhancement thus facilitates capturing information on words, phrases and sentences — for many languages, described in many languages. A major piece of this enhancement was the introduction of “lexeme” as a third so-called entity type (the existing ones being “item” and “property”). This entity type allows important features of lexemes (such as lemmas, forms or senses) to be captured easily based on the general entity type provisions for properties, qualifiers, references and so on (see Figure 5).
The contribution of lexeme-related information is in full swing. Nearly 45,000 lexemes (see Figure 6) ranging from what one could call rudimentary, to advanced (see Figure 7), to stunningly rich. A query to retrieve the “biggest” lexemes yields quite a number of lexemes with information for more than 100 features covering etymology, senses, grammatical information and more.
Figure 5: General data model (left), and data model for lexicographic data (right).
More sample queries (especially for getting started with your own experiments that specially target lexicographic information) include a bar chart of the ten languages with the most lexemes (see Figure 8) at http://w.wiki/3TF. You might also check out the number of lexical entries in Wikidata; the types of lexical properties in Wikidata; example statements for a lexeme; composition of an example lexeme; information about lexical forms for an example lexeme; and grammatical features of lexical forms for an example lexeme.
Since SPARQL is a technology for the world of web services, it can be used with any of the programming languages that are used to build web service to create anything from single purpose apps, to powerful, flexible solutions.
- Match terminology for a domain (e.g. www.agilealliance.org/agile101/agile-glossary/ or www.scrum.org/resources/scrum-glossary) against Wikidata to find existing translations, for instance.
- Match a text against Wikidata to (e.g. via https://tools.wmflabs.org/ordia/text-to-lexemes) get grammatical information for the text’s tokens or to get a gut feeling for the coverage of Wikidata in a certain domain (for one or more languages).
- Get a list of all lexical categories or lexemes/senses related to a certain category for a certain language. For example, to build a list of tokens that should be processed in a particular fashion (e.g. ignored during indexing).
- Have a chatbot answer lexeme-related questions.
- Integrate with programming environment for kids (see http://dalelane.co.uk/blog/?p=3524 for an example related to machine learning and AI).
Within the Unicode consortium, a discussion has been started to use the Wikidata numbering system (“QID”) to create a system of emoji encoding that lies outside core Unicode regulation (see http://twitter.com/jenny8lee/status/1123335017919336451 and www.unicode.org/L2/L2019/19082-qid-emoji.pdf).
Interested constituencies and individuals could become active in Wikidata in a number of ways. For example, they could systematically integrate data categories relevant to a certain domain into Wikidata, or adapt existing Wikidata data categories to the needs of that domain. Perhaps they might systemize the mapping between Wikidata properties and domain-specific data categories. Or perhaps they could explain the added value of mapping for a certain domain (for example, access to multimedia assets). The possibilities are numerous, and as varied as the language industry itself.