Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lemmatization | Stemming and Lemmatization
Introduction to NLP

LemmatizationLemmatization

Lemmatization vs Stemming

First, let's define what lemmatization is and how it is different from stemming.

Lemmatization is a text normalization technique used in NLP to reduce words to their base or root form, known as a lemma.

Unlike stemming, which crudely chops off word endings, lemmatization considers the context and converts the word to its meaningful base form. For example, 'am', 'are', and 'is' are all lemmatized into 'be'. This approach can significantly reduce the size of the vocabulary (the number of unique words) in large text corpora, thereby increasing efficiency when training models.

On the other hand, while lemmatization is more accurate, it is also more computationally expensive and can be time-consuming with large datasets. Furthermore, for even better accuracy, performing morphological analysis and part of speech tagging is recommended before lemmatization.

Don't worry, we will discuss part of speech tagging in the next chapter.

Lemmatization with NLTK

The WordNet Lemmatizer, provided by the NLTK library, leverages the WordNet corpus to perform lemmatization.

WordNet is not just any corpus but a rich and semantically oriented database of English words. It organizes words into sets of synonyms called synsets, providing a unique framework for understanding word meanings based on their semantic relationships. Each synset represents a distinct concept and includes definitions, examples, and semantic relations between terms, such as hypernyms (more general terms) and hyponyms (more specific terms).

When you use the WordNet Lemmatizer, it looks up the target word in the WordNet database to find the most appropriate lemma (base form) of the word.

As mentioned above, because words can have different meanings in different contexts (e.g., "running" as a verb vs. "running" as a noun), the lemmatizer may require you to specify the part of speech (e.g., verb, noun, adjective). This helps it select the correct lemma based on the word's role in a sentence.

Let's now take a look at an example:

Code Description
from nltk.stem import WordNetLemmatizer

This line imports the WordNetLemmatizer class.

nltk.download('wordnet')

This line downloads the WordNet corpus and thus ensures that all the functionalities related to WordNet, such as lemmatization can be used.

lemmatizer = WordNetLemmatizer()

This line creates an instance of the WordNetLemmatizer class used to perform lemmatization.

lemmatized_words = [lemmatizer.lemmatize(word, 'v') for word in words]

This line creates a list of lemmatized words using list comprehension. The process of lemmatization is performed via the lemmatize() method of the lemmatizer object. The string representing a word should be its first argument and the desired part of speech as the second optional argument ('v' for verb, 'a' for adjective, 'n' for noun, etc.).

As you can see, from the coding perspective, the approach is rather straightforward. Once again, in order to lemmatize words accurately across an entire corpus, it would be best to first perform part-of-speech (POS) tagging, which we will cover in the following chapter.

Tarefa

Your task is to lemmatize the tokens, given the text string. Tokenization is already applied with stop words filtered out.

  1. Import the WordNet lemmatizer.
  2. Download the WordNet corpus.
  3. Initialize the WordNet lemmatizer.
  4. Lemmatize the tokens using list comprehension.

Tudo estava claro?

Seção 2. Capítulo 3
toggle bottom row
course content

Conteúdo do Curso

Introduction to NLP

LemmatizationLemmatization

Lemmatization vs Stemming

First, let's define what lemmatization is and how it is different from stemming.

Lemmatization is a text normalization technique used in NLP to reduce words to their base or root form, known as a lemma.

Unlike stemming, which crudely chops off word endings, lemmatization considers the context and converts the word to its meaningful base form. For example, 'am', 'are', and 'is' are all lemmatized into 'be'. This approach can significantly reduce the size of the vocabulary (the number of unique words) in large text corpora, thereby increasing efficiency when training models.

On the other hand, while lemmatization is more accurate, it is also more computationally expensive and can be time-consuming with large datasets. Furthermore, for even better accuracy, performing morphological analysis and part of speech tagging is recommended before lemmatization.

Don't worry, we will discuss part of speech tagging in the next chapter.

Lemmatization with NLTK

The WordNet Lemmatizer, provided by the NLTK library, leverages the WordNet corpus to perform lemmatization.

WordNet is not just any corpus but a rich and semantically oriented database of English words. It organizes words into sets of synonyms called synsets, providing a unique framework for understanding word meanings based on their semantic relationships. Each synset represents a distinct concept and includes definitions, examples, and semantic relations between terms, such as hypernyms (more general terms) and hyponyms (more specific terms).

When you use the WordNet Lemmatizer, it looks up the target word in the WordNet database to find the most appropriate lemma (base form) of the word.

As mentioned above, because words can have different meanings in different contexts (e.g., "running" as a verb vs. "running" as a noun), the lemmatizer may require you to specify the part of speech (e.g., verb, noun, adjective). This helps it select the correct lemma based on the word's role in a sentence.

Let's now take a look at an example:

Code Description
from nltk.stem import WordNetLemmatizer

This line imports the WordNetLemmatizer class.

nltk.download('wordnet')

This line downloads the WordNet corpus and thus ensures that all the functionalities related to WordNet, such as lemmatization can be used.

lemmatizer = WordNetLemmatizer()

This line creates an instance of the WordNetLemmatizer class used to perform lemmatization.

lemmatized_words = [lemmatizer.lemmatize(word, 'v') for word in words]

This line creates a list of lemmatized words using list comprehension. The process of lemmatization is performed via the lemmatize() method of the lemmatizer object. The string representing a word should be its first argument and the desired part of speech as the second optional argument ('v' for verb, 'a' for adjective, 'n' for noun, etc.).

As you can see, from the coding perspective, the approach is rather straightforward. Once again, in order to lemmatize words accurately across an entire corpus, it would be best to first perform part-of-speech (POS) tagging, which we will cover in the following chapter.

Tarefa

Your task is to lemmatize the tokens, given the text string. Tokenization is already applied with stop words filtered out.

  1. Import the WordNet lemmatizer.
  2. Download the WordNet corpus.
  3. Initialize the WordNet lemmatizer.
  4. Lemmatize the tokens using list comprehension.

Tudo estava claro?

Seção 2. Capítulo 3
toggle bottom row
some-alt