Follow Datanami:
April 5, 2022

Stemming vs Lemmatization in NLP

Richard "Brin" Brindley

(Cineos/Shutterstock)

Stemming has long been accepted as an important part of natural language processing (NLP). However, as both artificial intelligence (AI) and NLP have evolved over the years, expectations of the accuracy of what constitutes NLP have grown — and that warranted rise in expectations means stemming should no longer be considered the de-facto approach in NLP.

Stemming does not meet the ultimate goal of NLP because there is nothing natural about the way it often results in non-linguistic or meaningless results. For many use cases where stemming is considered the standard, an alternative method, lemmatization, is a much more effective approach, and can produce results worthy of the much-vaunted term NLP.

Here’s how stemming and lemmatization stack up; why the latter, and not the former, should be considered the default mechanism to use in NLP, and the stakes of this distinction for business.

What Is Stemming?

Stemming algorithms work by cutting off the end or beginning of a word in order to find a root version of it, or its base form; they achieve this by considering a list of common prefixes and suffixes often found in inflected versions of the word, and eliminating them.

For example, consider the word ‘study.’ Stemming would allow a machine to analyze the word “studying” and correctly identify the base word “study.” A machine could use this result to categorize content or understand what a text is about. But a quick look at another inflection of the word ‘study’ — say, ‘studies’ — reveals why stemming is a suboptimal form of language processing. Stemming would turn “studies” into “studi,” failing to capture the word’s true base form (which, as will make more sense in the next section, is known as the word’s lemma).

(whiteMocca/Shutterstock)

Stemming may seem in some ways crude, but its use is still common in business cases that require language processing. For example, because privacy restrictions are making it harder for advertisers and data companies to serve internet users ads based on their personal behavior, advertising technology (adtech) companies are increasingly turning to analyses of the content on webpages to serve ads related to that content. Many of them are using stemming to analyze content, determine its meaning, and serve ads accordingly. But as a quick analysis of ‘studies’ indicates, this widespread method could result in lack of precision, and the waste of millions of dollars in ineffective advertising spending.

What’s more, the companies using stemming for adtech and other business use cases may present their solutions as full NLP, but this serves to obscure the inefficiency of their methods and raises questions around scientific imprecision. These are the problems a comparative analysis of stemming and lemmatization can address.

How Lemmatization Improves Upon Stemming

Lemmatization takes into consideration the morphology of a word to detect its lemma: the base form of all its inflections. In other words, lemmatization does not chop off part of a word in the hope of identifying its ‘stem,’ as in stemming. Rather, it recognizes a lemma as the canonical form of a set of words, allowing for much higher accuracy in determining what a text really means.

Let’s return to the example of ‘studies.’ When stemming encounters ‘studies,’ it merely chops off the last two letters, failing to produce the correct term, ‘study.’ By contrast, lemmatization recognizes “studies” as the third-person singular conjugation of the verb “study” and correctly identifies “study” as the word’s lemma. As you might imagine, lemmatization requires more technical sophistication and data processing than stemming; to correctly match “studies” to “study,” lemmatization algorithms rely on detailed dictionaries, which they can look through to link the inflection to its base form.

Obviously, tying the inflection of a word back to its lemma eliminates imprecision in content analysis, helping machines better understand the meaning of individual words, but lemmatization’s benefits are broader than that, too.

(Profit_image/Shutterstock)

Because lemmatization allows a machine to more accurately deduce the meaning of words, the technique optimizes the data available to it more accurately and, unlike stemming, avoids discarding a great deal of words due to shallow, imprecise filtering. This vastly improves a system’s understanding of context and ability to match the meaning of a text to other texts (as in the use case of contextual advertising, where an ad must be matched with a text to whose meaning it is related).

What Do We Mean When We Say NLP?

The democratization of sophisticated data-driven technologies often leads to imprecise usage of terms, confusing vendors, customers, investors, and others as to precisely what solutions tech companies are actually able to supply, and what the buzzwords of our age signify. For example, what is AI? Is machine learning AI? The metaverse? Natural language processing has taken its place among these unclear terms.

One aspect of NLP on which we should be able to agree on is that it refers to the ability of machines to understand language naturally, or as humans do. With this in mind, it should be clear that stemming, which would fail to recognize the word “studies” as an inflection of “study,” does not completely meet the brief. Lemmatization, on the other hand, not only mimics human understanding in its ability to match “studies” to “study,” it also empowers true semantic analysis by enabling the matching of that word to all the inflective versions of related terms.

The correct detection of a word’s meaning via lemmatization allows for the creation of lemma maps, which are networks of words related to a lemma. For example, having found the word “study,” a lemmatization-empowered program when presented with a map that also includes terms such as “research” and “inspect”, should be able to match that word and its many inflected forms with the other terms in the map and their inflections, such as “researching” and “inspected.” This sophisticated, or indeed, natural language processing allows a machine to develop a deep understanding of content’s meaning — an understanding not so dissimilar from that of the human the machine endeavors to emulate.

The Stakes of the Stemming-Lemmatization Distinction

We might highlight two different kinds of stakes for the stemming-lemmatization distinction and why it matters.

The first is clarity about terms. Technology evolves, and as it does, our understanding of the language we use to refer to technology should evolve with it. Using stemming as the default method in NLP is not only scientifically imprecise but also risks misleading investors, customers (in business), fellow technologists, and scientists. We need clarity to build the best solutions and deal with each other honestly going forward.

Second, in the realm of business, NLP is becoming vastly more relevant for search, voice-enabled technologies, and advertising, among other applications. Identifying stemming as inaccurate and lemmatization as the default NLP method will push all these disciplines forward, better serving engineers, businesses, and customers alike. That’s not just good business; it’s a more ethical way to work with one another.

About the author: Richard “Brin” Brindley is the Chief Information Officer (CIO) and UK General Manager of Vibrant Media, the technology company that addresses the full range of agencies’ and marketers’ contextual data and privacy-safe advertising needs. Brin’s strong technical achievement and project management background, alongside his passion for data science and machine learning/AI systems, has prepared him to excel at both hands-on leading edge product development projects as well as large scale infrastructure design and implementation. Brin is an active English Rugby Football Union Referee, scuba diver, wine lover, and musician. 

Related Items:

Natural Language Processing: The Next Frontier in AI for the Enterprise

10 NLP Predictions for 2022

One Model to Rule Them All: Transformer Networks Usher in AI 2.0, Forrester Says

 

 

Datanami