The term content analytics refers to a family of technologies that extract and process information from unstructured content. A typical application scenario of content analytics is sentiment analysis related to Twitter tweets — determining, for example, if the tweets about a product are positive or negative.
In addition to natural language processing (NLP) technologies such as entity extraction, large data sets (such as a catalogue with product names, or English adjectives with a positive connotation) are important ingredients to content analytics. Thus, freely available data sets that can easily be used due to their compliance with the linked data collection of standards from the World Wide Web Consortium (W3C) are continuously rising in importance. The overall goal is to build a cloud of highly usable, interconnected data sets — a linked open data cloud.
For anyone working on multilingual content creation, provisioning, curation and so on, content analytics can enable new or enhanced solutions and revenue opportunities related, for example, to analysis, enrichment/generation and linking of content.
The automated analysis of large content collections in real-time (real-time big data analytics) is a current reality. There are now applications detecting product defects by analyzing social media content such as Twitter tweets like “product X stopped working for me after only two days of use.”
The next evolutionary step is currently being taken. On the one hand, more and more content is captured in standardized, extensible representations. In the realm of user assistance for software, for example, XML-based representations such as the Darwin Information Typing Architecture (DITA) or HTML5-based representations are being used. This enables content to be easily analyzed, repurposed, enriched or linked. On the other hand, an ever-growing number of standards-based knowledge sources and data sets are being put onto the internet and into the public domain. They can thus be used as the basis for automatic content enrichment and automatic information linking, for example. The most relevant technological base that allows representation and processing of knowledge sources in a standardized form is referred to as Linked Open Data.
Social media content such as tweets like “product X stopped working for me after only two days in use” and “button X does not work” can provide important information to both consumers and the producers of the product. Content analytics extracts raw information from this type of content, and submits this raw information to additional processing such as automatic reasoning that draws conclusions.
One content analytics technology that has already gained a lot of visibility is sentiment analysis. It generates information such as that 15% of all tweets related to product X contain negative reviews. Business intelligence solutions use country-specific, locale-specific and language-specific sentiment analysis to find out if opinions related to products or services differ between countries, regions or markets.
Technological foundations and examples of content analytics
The basis of content analytics is a set of technologies from NLP. An important sub-area of NLP relates to so-called entities mentioned in a text: things, such as the capital of Germany, actions, such as writing a novel, and so on. State-of-the-art NLP provides various capabilities related to entities. It can recognize names for entities (Figure 1), find relationships (Figure 2), link entities (Figure 3) and disambiguate entities (Figure 4). A functionality such as named entity recognition receives as input a string or text. As output, it produces a list of words or expressions that presumably are terms for entities.
One of the tools in the realm of content analytics is DBpedia Spotlight. It offers, among other things, named entity recognition (NER), named entity linking and named entity disambiguation. The databases for DBpedia Spotlight are huge data sets that constitute collections of facts (see Figure 5). These language-specific so-called DBpedias are generated automatically with NLP from language-specific Wikipedia versions. Thus, there is a German DBpedia generated from the German Wikipedia, and an English DBpedia generated from the English Wikipedia.
DBpedia can be used in various ways, ranging from tools with graphical user interfaces (see Figures 6 and 7) to application programming interfaces (APIs).
APIs such as those of DBpedia are often the key to powerful and user-friendly application scenarios. A video at www.youtube.com/watch?v=F6zIW6blF5k shows how to automatically generate standardized metadata by coupling DBpedia Spotlight with the oXygen editor (see Figure 8 for an example).
The power of tools for content analytics depends among other things on the data set used. Look, for example, at DBpedia. Since the English Wikipedia is much larger than the one for German, working with the German Wikipedia content yields different results and annotations than working with the English Wikipedia content.
DBpedia is a prototypical example of a class of data sources that have become known on the web as open data. Open in this context refers to the licensing model (such as one of the Creative Commons licenses) that applies. The data sources that are of particular interest to content analytics are Yago, Wikidata and BabelNet (see Figure 9, which shows Babelfy, an application built on top of Babelnet). These go far beyond simple lists of words. Rather, these data sources are collections of facts and assertions similar to the subject-predicate-object statements like “Paris is the capital of France” that we know from grammar textbooks. A special feature of the data sources is that very often the facts themselves form implicit networks. For example, all of the facts that include Paris (either as a subject or as an object), can be conceived as being implicitly related to each other. By assessing a data source of this type, content analytics can thus access information on entities, entity classes, entity relationships and so on. The data sources that are linguistically rich provide information such as (preferred) names of entities in different languages, spelling variants and so on (Figure 10).
DBpedia and many other open data sources often use a standardized technology stack that is referred to as Linked Data. At the heart of Linked Data are general internet technologies such as the HTTP protocol and two general principles. The first principle addresses a standardized data representation via the Resource Description Framework (RDF), while the second one addresses standardized access via SPARQL as a query language and HTTP uniform resource identifiers as names for things on the web. Through these basic principles one can use datasets without special proprietary processing. One such use is through content analytics tools. The basic principles also allow easy linking of data sources to the Linked Data cloud (see http://lod-cloud.net). For content analytics one part of the Linked Data cloud is especially relevant, the Linguistic Linked Open Data cloud. This cloud represents, among others, encyclopedias, collections of texts, lexical databases and so on as Linked Data. See a visualization of the Linguistic Linked Open Data cloud at http://linguistics.okfn.org/resources/llod.
Text processing and content analytics
Thanks to content analytics, content can be curated, blended and merged in various ways — especially if the contents use standardized, structured formats such as DITA, DocBook, ePub or HTML5. One sample of processing is the automatic generation of metadata and markup mentioned previously — hyperlinks, for example. Another example is the use of named entity recognition for cross-checks against a terminology database.
The Unicode Localization Interoperability (ULI) Technical Committee was established in 2011 with the goal of helping to ensure interoperable data interchange of critical localization-related assets. ULI’s work is relevant to speech/natural language processing, analytics and more, including translation memories. What ULI is building contributes to the foundations of many technologies related to human languages: tokenization and segmentation. In 2014, ULI and DBpedia committers started to investigate how DBpedia data could help to improve segmentation. Results are promising (see Figure 11).
Participating in content analytics
For content creators, the use of content analytics can generate additional revenue opportunities. Content gets turned into raw material for various spots in content production value chains, especially in the context of content-related automation.
One example is the automatic generation of structured, so-called schema.org markup information. This information is interpreted by search engines and increases the visibility of content on the web.
In addition to the raw content, by using content analytics tools, content creators can thus offer optimization for search engines to their customers.
The content analytics train is currently picking up speed. One way not to miss it, and even to control it, is active contribution to relevant topics. An excellent way to do this is through the LIDER project, which stands for “Linked Data as an enabler of cross-media and multilingual content analytics for enterprises across Europe.” LIDER investigates the question “Which application scenarios for Linked Data exist?” Furthermore, the project investigates which dimensions are relevant for the industrial use of open data (with respect to licenses, for example) and which monetization models for open data already exist or are conceivable.
LIDER discusses these and other questions in different dimensions. Multilingual content production (with sub-areas such as translation and localization) receives high priority in these discussions. The same applies to the question of how professional fields change due to open data and content analytics. Will the content architect morph into a content curator?
As a framework for discussion, LIDER uses the W3C Linked Data for Language Technology Community Group (see www.w3.org/community/ld4lt/). This group is open to anyone. Activities include telephone conferences, workshops, surveys and so on. The intermediate results are made publicly available in the group’s wiki.
The LIDER Project is sponsored by the European Commission as reference number 610782 in the topic ICT-2013.4.1: Content analytics and language technologies.
Some of the material used in this article first appeared in German for the tekom Annual Conference (November 2014). It appears here as a translation with several modifications.