Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Stemming
course content

Course Content

Introduction to NLP

StemmingStemming

Understanding Stemming

To start off, let's first understand what stemming essentially is.

Stemming is a text normalization technique widely used in NLP to reduce words to their root form. The goal is to transform similar words with different inflections into a common base form, capturing the core meaning of the word.

To be more precise, stemming involves removing suffixes from words to obtain their root form, known as the stem. For example, the stems of "running," "ran," and "runner" are all "run." As mentioned above, the purpose of stemming is to simplify the analysis by treating similar words as the same entity, ultimately enhancing the efficiency and effectiveness of various NLP tasks.

Stemming with NLTK

NLTK provides various stemming algorithms, with the most popular being the Porter Stemmer and the Lancaster Stemmer. These algorithms apply specific rules to strip affixes and derive the stem of a word.

All of the stemmer classes in NLTK share a common interface. First, you have to create an instance of the stemmer class and then use its stem() method for each of the tokens. Let's take a look at the following example:

As you can see, there is nothing complicated here. First, we applied tokenization, then filtered out the stop words and finally applied stemming on our tokens using list comprehension. Speaking of the results, these two stemmers produced rather different results. This is due to the fact that the Lancaster Stemmer has about twice as many rules as the Porter Stemmer and is one of the most "aggressive" stemmers.

Overall, the Porter Stemmer is the most popular option producing more meaningful results than the Lancaster Stemmer, which tends to overstem words.

Everything was clear?

Section 2. Chapter 1
some-alt