Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lemmatization with POS Tagging | Stemming and Lemmatization
Introduction to NLP
course content

Conteúdo do Curso

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

bookLemmatization with POS Tagging

The English language is full of words that can serve as multiple parts of speech with different meanings. For example, "running" can be a verb ("He is running.") or a noun ("Running is fun.").

As we have already seen, a lemmatizer can only accurately reduce a word to its base form if it knows the word's part of speech in the given context. POS tagging in turn provides this context, making lemmatization more precise.

Lemmatization with POS Tagging in NLTK

Since we are already familiar with both of these techniques separately, it's time to combine them. However, there is one important aspect to take into consideration, namely the difference in POS tags format between pos_tagand the format that WordNet Lemmatizer expects.

The mapping process involves converting the detailed Penn Treebank tags to the broader categories recognized by WordNet. For example, both 'VBD' (past tense verb) and 'VBG' (gerund or present participle) from Penn Treebank would map to 'v' (verb) for use with the WordNet Lemmatizer.

Let's write a function for this purpose:

This function simply checks the first letter of the Penn Treebank tag: if it's 'J', it returns the WordNet tag for adjectives; if 'V', for verbs; if 'R', for adverbs.

For all other cases, including when the tag starts with 'N' or doesn't match any specified condition, it defaults to returning the WordNet tag for nouns. These ADJ, VERB etc. are just constants, where ADJ, ADJ_SAT, ADV, NOUN, VERB = "a", "s", "r", "n", "v".

Given this function, let's now perform lemmatization with POS tagging beforehand:

123456789101112131415161718192021222324252627282930313233
from nltk.stem import WordNetLemmatizer from nltk.tokenize import word_tokenize from nltk import pos_tag from nltk.corpus import wordnet as wn import nltk nltk.download('wordnet') nltk.download('averaged_perceptron_tagger_eng') nltk.download('punkt_tab') # Initialize the lemmatizer lemmatizer = WordNetLemmatizer() # Function to map NLTK's POS tags to the format used by the WordNet lemmatizer def get_wordnet_pos(treebank_tag): if treebank_tag.startswith('J'): return wn.ADJ elif treebank_tag.startswith('V'): return wn.VERB elif treebank_tag.startswith('R'): return wn.ADV else: # Default to noun if no match is found or starts with 'N' return wn.NOUN text = "The leaves on the tree were turning a bright red, indicating that fall was leaving its mark." text = text.lower() tokens = word_tokenize(text) tagged_tokens = pos_tag(tokens) # Lemmatize each token with its POS tag lemmatized_tokens = [lemmatizer.lemmatize(token, get_wordnet_pos(tag)) for token, tag in tagged_tokens] print("Original text:", text) print("Lemmatized text:", ' '.join(lemmatized_tokens))
copy

As you can see, we first performed POS tagging using the pos_tag() function, next we used list comprehension to create a list of lemmatized tokens by applying the lemmatize() method with the current token and correctly formatted tag (using our function get_wordnet_pos(tag)) as its arguments. We intentionally did not remove stop words to demonstrate that the code effectively processes all tokens.

Tarefa
test

Swipe to show code editor

It's time to combine all the text preprocessing techniques we have learned so far to get lemmatized text without the stop words, given the initial raw text. Your task is the following:

  1. Convert text to lowercase.

  2. Load the list of English stop words and convert it to set.

  3. Initialize a lemmatizer.

  4. Tokenize the text string.

  5. Filter out the stop words using list comprehension.

  6. Perform POS tagging using the respective function.

  7. Lemmatize the resulting tokens taking their POS tags into account using list comprehension.

Switch to desktopMude para o desktop para praticar no mundo realContinue de onde você está usando uma das opções abaixo
Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 2. Capítulo 5
toggle bottom row

bookLemmatization with POS Tagging

The English language is full of words that can serve as multiple parts of speech with different meanings. For example, "running" can be a verb ("He is running.") or a noun ("Running is fun.").

As we have already seen, a lemmatizer can only accurately reduce a word to its base form if it knows the word's part of speech in the given context. POS tagging in turn provides this context, making lemmatization more precise.

Lemmatization with POS Tagging in NLTK

Since we are already familiar with both of these techniques separately, it's time to combine them. However, there is one important aspect to take into consideration, namely the difference in POS tags format between pos_tagand the format that WordNet Lemmatizer expects.

The mapping process involves converting the detailed Penn Treebank tags to the broader categories recognized by WordNet. For example, both 'VBD' (past tense verb) and 'VBG' (gerund or present participle) from Penn Treebank would map to 'v' (verb) for use with the WordNet Lemmatizer.

Let's write a function for this purpose:

This function simply checks the first letter of the Penn Treebank tag: if it's 'J', it returns the WordNet tag for adjectives; if 'V', for verbs; if 'R', for adverbs.

For all other cases, including when the tag starts with 'N' or doesn't match any specified condition, it defaults to returning the WordNet tag for nouns. These ADJ, VERB etc. are just constants, where ADJ, ADJ_SAT, ADV, NOUN, VERB = "a", "s", "r", "n", "v".

Given this function, let's now perform lemmatization with POS tagging beforehand:

123456789101112131415161718192021222324252627282930313233
from nltk.stem import WordNetLemmatizer from nltk.tokenize import word_tokenize from nltk import pos_tag from nltk.corpus import wordnet as wn import nltk nltk.download('wordnet') nltk.download('averaged_perceptron_tagger_eng') nltk.download('punkt_tab') # Initialize the lemmatizer lemmatizer = WordNetLemmatizer() # Function to map NLTK's POS tags to the format used by the WordNet lemmatizer def get_wordnet_pos(treebank_tag): if treebank_tag.startswith('J'): return wn.ADJ elif treebank_tag.startswith('V'): return wn.VERB elif treebank_tag.startswith('R'): return wn.ADV else: # Default to noun if no match is found or starts with 'N' return wn.NOUN text = "The leaves on the tree were turning a bright red, indicating that fall was leaving its mark." text = text.lower() tokens = word_tokenize(text) tagged_tokens = pos_tag(tokens) # Lemmatize each token with its POS tag lemmatized_tokens = [lemmatizer.lemmatize(token, get_wordnet_pos(tag)) for token, tag in tagged_tokens] print("Original text:", text) print("Lemmatized text:", ' '.join(lemmatized_tokens))
copy

As you can see, we first performed POS tagging using the pos_tag() function, next we used list comprehension to create a list of lemmatized tokens by applying the lemmatize() method with the current token and correctly formatted tag (using our function get_wordnet_pos(tag)) as its arguments. We intentionally did not remove stop words to demonstrate that the code effectively processes all tokens.

Tarefa
test

Swipe to show code editor

It's time to combine all the text preprocessing techniques we have learned so far to get lemmatized text without the stop words, given the initial raw text. Your task is the following:

  1. Convert text to lowercase.

  2. Load the list of English stop words and convert it to set.

  3. Initialize a lemmatizer.

  4. Tokenize the text string.

  5. Filter out the stop words using list comprehension.

  6. Perform POS tagging using the respective function.

  7. Lemmatize the resulting tokens taking their POS tags into account using list comprehension.

Switch to desktopMude para o desktop para praticar no mundo realContinue de onde você está usando uma das opções abaixo
Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 2. Capítulo 5
toggle bottom row

bookLemmatization with POS Tagging

The English language is full of words that can serve as multiple parts of speech with different meanings. For example, "running" can be a verb ("He is running.") or a noun ("Running is fun.").

As we have already seen, a lemmatizer can only accurately reduce a word to its base form if it knows the word's part of speech in the given context. POS tagging in turn provides this context, making lemmatization more precise.

Lemmatization with POS Tagging in NLTK

Since we are already familiar with both of these techniques separately, it's time to combine them. However, there is one important aspect to take into consideration, namely the difference in POS tags format between pos_tagand the format that WordNet Lemmatizer expects.

The mapping process involves converting the detailed Penn Treebank tags to the broader categories recognized by WordNet. For example, both 'VBD' (past tense verb) and 'VBG' (gerund or present participle) from Penn Treebank would map to 'v' (verb) for use with the WordNet Lemmatizer.

Let's write a function for this purpose:

This function simply checks the first letter of the Penn Treebank tag: if it's 'J', it returns the WordNet tag for adjectives; if 'V', for verbs; if 'R', for adverbs.

For all other cases, including when the tag starts with 'N' or doesn't match any specified condition, it defaults to returning the WordNet tag for nouns. These ADJ, VERB etc. are just constants, where ADJ, ADJ_SAT, ADV, NOUN, VERB = "a", "s", "r", "n", "v".

Given this function, let's now perform lemmatization with POS tagging beforehand:

123456789101112131415161718192021222324252627282930313233
from nltk.stem import WordNetLemmatizer from nltk.tokenize import word_tokenize from nltk import pos_tag from nltk.corpus import wordnet as wn import nltk nltk.download('wordnet') nltk.download('averaged_perceptron_tagger_eng') nltk.download('punkt_tab') # Initialize the lemmatizer lemmatizer = WordNetLemmatizer() # Function to map NLTK's POS tags to the format used by the WordNet lemmatizer def get_wordnet_pos(treebank_tag): if treebank_tag.startswith('J'): return wn.ADJ elif treebank_tag.startswith('V'): return wn.VERB elif treebank_tag.startswith('R'): return wn.ADV else: # Default to noun if no match is found or starts with 'N' return wn.NOUN text = "The leaves on the tree were turning a bright red, indicating that fall was leaving its mark." text = text.lower() tokens = word_tokenize(text) tagged_tokens = pos_tag(tokens) # Lemmatize each token with its POS tag lemmatized_tokens = [lemmatizer.lemmatize(token, get_wordnet_pos(tag)) for token, tag in tagged_tokens] print("Original text:", text) print("Lemmatized text:", ' '.join(lemmatized_tokens))
copy

As you can see, we first performed POS tagging using the pos_tag() function, next we used list comprehension to create a list of lemmatized tokens by applying the lemmatize() method with the current token and correctly formatted tag (using our function get_wordnet_pos(tag)) as its arguments. We intentionally did not remove stop words to demonstrate that the code effectively processes all tokens.

Tarefa
test

Swipe to show code editor

It's time to combine all the text preprocessing techniques we have learned so far to get lemmatized text without the stop words, given the initial raw text. Your task is the following:

  1. Convert text to lowercase.

  2. Load the list of English stop words and convert it to set.

  3. Initialize a lemmatizer.

  4. Tokenize the text string.

  5. Filter out the stop words using list comprehension.

  6. Perform POS tagging using the respective function.

  7. Lemmatize the resulting tokens taking their POS tags into account using list comprehension.

Switch to desktopMude para o desktop para praticar no mundo realContinue de onde você está usando uma das opções abaixo
Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

The English language is full of words that can serve as multiple parts of speech with different meanings. For example, "running" can be a verb ("He is running.") or a noun ("Running is fun.").

As we have already seen, a lemmatizer can only accurately reduce a word to its base form if it knows the word's part of speech in the given context. POS tagging in turn provides this context, making lemmatization more precise.

Lemmatization with POS Tagging in NLTK

Since we are already familiar with both of these techniques separately, it's time to combine them. However, there is one important aspect to take into consideration, namely the difference in POS tags format between pos_tagand the format that WordNet Lemmatizer expects.

The mapping process involves converting the detailed Penn Treebank tags to the broader categories recognized by WordNet. For example, both 'VBD' (past tense verb) and 'VBG' (gerund or present participle) from Penn Treebank would map to 'v' (verb) for use with the WordNet Lemmatizer.

Let's write a function for this purpose:

This function simply checks the first letter of the Penn Treebank tag: if it's 'J', it returns the WordNet tag for adjectives; if 'V', for verbs; if 'R', for adverbs.

For all other cases, including when the tag starts with 'N' or doesn't match any specified condition, it defaults to returning the WordNet tag for nouns. These ADJ, VERB etc. are just constants, where ADJ, ADJ_SAT, ADV, NOUN, VERB = "a", "s", "r", "n", "v".

Given this function, let's now perform lemmatization with POS tagging beforehand:

123456789101112131415161718192021222324252627282930313233
from nltk.stem import WordNetLemmatizer from nltk.tokenize import word_tokenize from nltk import pos_tag from nltk.corpus import wordnet as wn import nltk nltk.download('wordnet') nltk.download('averaged_perceptron_tagger_eng') nltk.download('punkt_tab') # Initialize the lemmatizer lemmatizer = WordNetLemmatizer() # Function to map NLTK's POS tags to the format used by the WordNet lemmatizer def get_wordnet_pos(treebank_tag): if treebank_tag.startswith('J'): return wn.ADJ elif treebank_tag.startswith('V'): return wn.VERB elif treebank_tag.startswith('R'): return wn.ADV else: # Default to noun if no match is found or starts with 'N' return wn.NOUN text = "The leaves on the tree were turning a bright red, indicating that fall was leaving its mark." text = text.lower() tokens = word_tokenize(text) tagged_tokens = pos_tag(tokens) # Lemmatize each token with its POS tag lemmatized_tokens = [lemmatizer.lemmatize(token, get_wordnet_pos(tag)) for token, tag in tagged_tokens] print("Original text:", text) print("Lemmatized text:", ' '.join(lemmatized_tokens))
copy

As you can see, we first performed POS tagging using the pos_tag() function, next we used list comprehension to create a list of lemmatized tokens by applying the lemmatize() method with the current token and correctly formatted tag (using our function get_wordnet_pos(tag)) as its arguments. We intentionally did not remove stop words to demonstrate that the code effectively processes all tokens.

Tarefa
test

Swipe to show code editor

It's time to combine all the text preprocessing techniques we have learned so far to get lemmatized text without the stop words, given the initial raw text. Your task is the following:

  1. Convert text to lowercase.

  2. Load the list of English stop words and convert it to set.

  3. Initialize a lemmatizer.

  4. Tokenize the text string.

  5. Filter out the stop words using list comprehension.

  6. Perform POS tagging using the respective function.

  7. Lemmatize the resulting tokens taking their POS tags into account using list comprehension.

Switch to desktopMude para o desktop para praticar no mundo realContinue de onde você está usando uma das opções abaixo
Seção 2. Capítulo 5
Switch to desktopMude para o desktop para praticar no mundo realContinue de onde você está usando uma das opções abaixo
We're sorry to hear that something went wrong. What happened?
some-alt