The differences between lemmatization and stemming

By Joel Ross December 11, 2014

Human language technology (HLT) has become the trendy way of referring to the traditional concept of natural language processing (NLP). The main difference is that HLT tends to emphasize the technological part of the model. Also, processing a “natural language” could encompass communications between any living creatures, whether it’s birds chirping about the neighborhood cat, simian sign language, or dolphins’ telepathic plans to leave Earth. In essence, this is not our purpose; for this document, I will use the term HLT rather than NLP.

HLT is the field in which linguistics and computer science merge to solve problems in processing digital information. Think of it as a place where two normally disparate types of people — linguists and computer scientists — can come together and discuss a topic of interest to both groups. The only other intersect imaginable for two such factions might be The Lord of the Rings, although even here one group would contend that Tolkien’s use of gerunds in the Quenya language is flawed, while the other group would counter that Jackson’s vision of Middle Earth is weakened by excluding the Orcs’ attack on Lothlórien. Can’t we all just get along?

While many people have heard the term human language technology or its acronym HLT, quite a few of those people could not spot an HLT if they bumped into one. Ok, that’s a bad example, because no one can really bump into it, unless we start talking in metaphors. What I mean is many people know the term but find it difficult to adequately explain to someone who does not. There also are specific aspects and related HLT terminology that are not very well understood by a significant amount of people who must deal with HLT issues.

I will explain the main linguistics concepts of HLT, but I will not attempt to cover every detail. Therefore, you will not learn about morphemes, phonemes, lexemes or any other kind of eme right now. I’ll save segmentation, tokenization and n-gramization (I think I made up that word) for another time.

As mentioned above, HLT involves the auspices under which data processing needs are fulfilled through the use of linguistics and computational technologies. When you have a huge amount of digital data, and you need to process it — render it searchable and obtain valuable information from it — you will need to know some specific scientific methods. If the data is in a human language, such as English, you will also need to know how English is constructed and employed. If your data is in more than just one human language, then you will need to know even more about how languages in general are constructed, and how each individual language works differently than others.

Why do you need HLT?

Rather than attempting to explain every computational and linguistic concept that can be used, I will concentrate on a major issue in extracting information from data: search. What linguistic obstacles need to be overcome before you can efficiently search through data? Why do you need to employ HLT to obtain the results you need?

Let’s say your company’s president visited to try one of the kites your division manufactured and you want as many details of her visit as you can get — you are responsible for the company’s blog, but during the event you had fallen asleep at your desk while waiting for your brother’s next move in Words With Friends. So, you type president, fly and kite into your company’s search engine, and it spits out the results of its search. But what happens if your search engine hasn’t taken HLT into account? The simplest search will just look at the characters you typed and then look in the data for wherever that combination appears. Your results will show every instance of that perfect match, and you’ll be satisfied that you got everything you need to write your report and get that promotion, right? Well, without HLT, you might get fired.

First, you’ll get all the exact hits of president, fly and kite, but none of the other variations of the words that you also need: presidents, president’s, presidents’; flies, flew, flown, flied, flying and so on. To a search program without HLT, president is not the same thing as president’s; the former is a nine-character sequence, and the latter is an unrelated 11-character sequence.

Second, you’ll get a lot of hits you really don’t want at all. Although you’re looking for the verb fly, you’ll also get a lot of completely unrelated nouns: fly the insect, fly the baseball term and fly the trouser part. And kite is also a type of raptor bird.

So how can HLT linguistics improve search technologies so you won’t get fired? Even though HLT cannot save you from your obvious general incompetence, there are many ways it can help you obtain better search results. I will focus on just one area for now: lemmatization and stemming.

Lemmatization

When one hears the word lemma, it is tempting to think of a camel-like creature from the Andes that spits in your face. Or perhaps a cute rodent-like ball of fur that hurls itself, along with its friends and family, off a cliff into the sea. Let me remind you, however, that these are zoological references, and this is a document about linguistics. We are not discussing mammals here; we are investigating lemmas, or lemmata, if you want to sound fancy.

A lemma is a word’s basic form — also called its canonical, headword or dictionary form. That’s a very simple explanation, but you still don’t know what that means, right? So what does a word’s “basic form” mean? It’s a word’s uninflected form. Inflection is altering a word to impart a specific characteristic, without altering its actual meaning. Let’s look at verbs and nouns to see how inflections change various characteristics of the basic lemma form.

Inflections usually are related to conjugation, which modifies verbs according to tense (past, present, future), person (first, second, third), mood (imperative, conditional) and number (singular, plural). Non-English languages may include additional verb inflections such as gender (masculine, feminine).

In English, inflection can change nouns based on characteristics such as number (singular, plural) and possession. Non-English languages often include additional noun inflections such as gender (masculine, feminine) and whether the noun is the subject or the object of the sentence, for example.

The verb be and the noun calf display many changes when they are inflected, and sometimes the change is dramatic (such as be inflected to were). However, you can inflect some words very simply. For instance, you can inflect many verbs, such as walk, just by adding the suffix ed to impart past tense (walked); or you can inflect many nouns, such as dog, with the suffix s to impart plurality (dogs), and so on. Some verbs, such as be, are more highly inflected than other verbs.

Now that you know what inflection means, you should know that something that is uninflected has no inflection. So, the uninflected word is the lemma. For English verbs, the lemma usually is the infinitive form (to be, towalk, toinflect), without the to. For English nouns, the lemma usually is the singular nonpossessive form (calf, dog, clown). Therefore, when you have an English verb or noun that does not fit these basic characteristics, you can find its lemma by converting it to the word that does have those characteristics (infinitive or singular nonpossessive). Of course, there are other languages that are more complex than English when dealing with lemmas.

Non-English lemmas

English-language verb lemmas are inflected by characteristics such as tense, person and number. There are verbs in many non-English languages that are inflected by other characteristics that don’t matter to the English lemmas. For instance, in Hebrew, verbs are inflected by gender and mood. For gender, the form of the verb changes whether the subject is a male or a female. For instance, the imperative mood in English for the verb lemma go is a command (Go away!) and the conditional mood is, well, conditional (I would go). You’ll notice that in each case, the lemma is not inflected. The point is that in some non-English languages, the lemma is inflected according to changes in grammatical mood. In Spanish (as in English), a lemma for a verb is derived from the infinitive form, as in escribir (to write). As you can see in Table 1, what you need to express in English through use of two or more words (such as I will write) can often be expressed in Spanish (and other languages) by only one word (escribiré), as seen in Table 2.

In English, a lemma for a verb is usually obtained from the infinitive form (to work). In Hebrew, it’s from the third-person/past-tense/masculine (he worked). Therefore, for the Hebrew verb דובעל (to work):

Lemma: דבע (he worked)

Gender: דבוע (I work masc.) תדבוע (I work fem.)

Mood: דובעי (work! masc.) דובעת (work! fem.)

In Hebrew, all of the lemma’s inflected forms usually retain all of the lemma’s characters (in this sample, the three red characters in each word). While neither gender nor mood affects the lemma in English (work), both inflections affect the lemma in Hebrew (and in other languages).

English-language lemmas for nouns are inflected by the characteristics number and possession. There are nouns in many non-English languages that are inflected by other characteristics that don’t matter to the English lemmas. For instance, in Spanish, nouns are inflected by gender. So is Hebrew, but I sense that constantly switching your reading between left-to-right and right-to-left is making you feel queasy, so I’ll stick with a language that heads in only one direction.

In English, a lemma for a noun is usually obtained from the singular, non-possessive form. In Spanish, it’s from the singular, masculine form. Therefore, the Spanish noun gato (cat) switches depending on gender: gato (male cat) and gata (female cat). Spanish nouns, like English nouns, are also inflected by number (gatos, gatas). Note that although English nouns are inflected by possession, Spanish nouns are not.

Stemming

Stemming has a goal similar to that of lemmatization — finding the “root” of a word — but it goes about it very differently and often obtains different results. I won’t use any zoological references in this case, but I will turn to anthropology. Think of stemming as the Neanderthals digging for roots, and lemmatization as the Homo sapiens. Neanderthals certainly have the brawn, and they can fashion a rock lashed onto the top of a stick for getting the work done. However, Homo sapiens appear to have a greater grasp of the overall problem. Likewise, stemming will get you some quick, basic results without much in-depth preparation, but lemmatization will obtain a higher quality of results because it uses more knowledge of the task at hand. Before you expand this comparison to the discovery of fire, cave drawings and mammoth dung, let me move us away from the anthropological references and bring us back to linguistics.

Remember that lemmatization takes a word and, based on what part of speech it is and which language you are using, it provides the “basic form” of that word. Well, stemming doesn’t do any of that. Instead of employing complex rules to account for inflection (gender, number, tense and so on), stemming just chops things off, perhaps with a linguistic rock on a stick. While the lemmatizing Homo sapiens might use fermented barley and saber-toothed tiger pelts to entice a possible mate, the stemming Neanderthal relies on a heavy club and a bonk on the head, perhaps with a rock on a stick. In some cases, the results can actually be the same (and usually involve a headache the next morning), but in many other instances results can greatly vary.

Harking back to biology again, we know that stems and roots are different parts of plants. However, in the linguistic sense, particularly in English, they are often used interchangeably. So, when we refer to the stem of a word, or its root, we are often talking about the same thing. However, in some non-English languages, the root of a word is something very specific, while the stem of that same word might be something different. Whereas a lemma is always an actual word with meaning, stems can be just a combination of letters that have no meaning at all. Whereas lemmatization does not change a word’s part of speech, stemming sometimes does. Unlike a lemma, which is a specific basic form of a word, a stem is the basic form of a word to which affixes can be, well, affixed.

Affixes are characters/letters that are attached to a stem and thereby create an inflected form of that stem or create a completely different word. Affixes can even change a stem’s part of speech. The two most common types of affixes are prefixes and suffixes. Prefixes are attached in front of the stem, and suffixes are attached at the end of the stem. Some words can even have both types at the same time. Affixes usually cannot stand on their own as separate, meaningful words. To find the stem of a word, one needs to chop off any affixes, irrespective of the word’s part of speech.

In English, it is rare to chop off a prefix, since the action can actually change the meaning of a word, it’s generally not done. In some other languages, such as Hebrew, prefixes are more prevalent:

dismount (verb; changes word
to its opposite meaning, so
dismount is the stem)

unhappy (adjective; changes
word to its opposite meaning, so
unhappy is the stem)

בבית — in the house (the prefix ב — in the is dropped, leaving the
stem בית -— house)

Suffixes in English can change a word to a different but related word:

walked (verb – changes tense)

porpoises (noun – changes number)

helpful (changes noun to adjective)

clearly (changes adjective to adverb)

punishment (changes verb to noun)

terrorize (changes noun to verb)

In all the previous examples, the stem is the italicized portion. These are all examples of stems that are actual words. However, sometimes the stem is not a word at all, particularly in non-English languages such as Spanish:

animated (anim is not a word)

arsenal (arsen is a different word)

prensa (prens is not a word)

traje (traj is not a word)

casa (cas is not a word)

There are various levels of stemming, which can result in much different results. For instance, in the word explosions, one stemming system can consider the s to be the suffix of the stem explosion. Another system might go further and consider ions as the suffix for the stem explos.

Lemmatization vs. stemming

Now you should know the difference between lemmatization and stemming. However, the best way to do this is to show how choosing one process or the other can lead to significant qualitative differences in the results when entering words as search terms, particularly against a multilingual database.

In a previous example, I pointed out how anim is the stem for the verb animated (after you chop off the suffix ated). However, anim is also the stem for the noun animal (after chopping off the suffix al). Now you have a problem: If you are searching for the term animated, you will get all the hits for animated and for animal. You are now overwhelmed with hits that are irrelevant to you — unless, of course, your actual aim in this particular case is to discover how to bring your dead parrot back to life (no, he is not “just resting”).

However, if you used lemmatization instead of stemming, your results generally will be more on target. For instance, the lemma for animated is animate (infinitive verb form), while the lemma for animal is animal (singular non-possessive noun form). Your hits already are more relevant to your search goal.

Sometimes words have stems and lemmas that look the same. For instance, for the verb sleeping, the lemma is sleep (infinitive form) and the stem is sleep (after removing the suffix ing). Just because the results are the same here doesn’t mean that the process to obtain the results was the same. For instance, with a slight inflection of this word, changing it to slept, the lemma remains sleep (infinitive form), but the stem is now slept, because it has no affixes, and stemming doesn’t know it’s just a different tense of the same verb.

Non-English search

In English, lemmas and stems often look the same. However, in non-English languages, many parts of speech are highly inflected, so the lemmas and stems will look very different much of the time.

Table 3 uses the same examples as were given previously when discussing Spanish lemmatization or stemming.

As you can see in these examples, stemming has reduced the search term to ambiguous, non-word stems, which can result in many irrelevant hits. One way stemmers avoid too many “bad” hits is by returning hits in a prioritized order based on statistics. For instance, if the noun casa (house) is more prevalent in Spanish than a form of the verb “casar” (to marry), then a stemmer might default to inflections of the noun rather than the verb when the ambiguous “casa” appears without context as a search term. Lemmatization, however, will differentiate between unaffiliated words, avoiding overkill with returned hits, while expanding the search to other relevant forms of the search term.

So, lemmatization and stemming are two methods for analyzing words for HLT enhancements in search technology. As I mentioned above, there are many additional morphological analytic techniques such as tokenization, segmentation and decompounding, and other concepts such as the n-gram probabilistic and the Bayesian hidden Markov models. Just be glad that you don’t have to sift through any of those in this document, although be forewarned about a sequel.