Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Stopwords | Identifying the Most Frequent Words in Text
Identifying the Most Frequent Words in Text

book
Stopwords

Stopwords are common words in a language that do not carry much meaning, such as "the", "and", and "of". In natural language processing tasks, removing stopwords is a common preprocessing step. This is because eliminating these words can improve the accuracy and efficiency of various algorithms and techniques applied to text data.

NLTK provides a built-in set of stopwords for several languages, including English, French, German, and Spanish. These stopwords can be easily removed from text using NLTK's stopwords module. By doing this, the resulting text data is left with only the most meaningful words, which can significantly enhance the performance of algorithms used in tasks like sentiment analysis and topic modeling.

Task

Swipe to start coding

  1. Import the 'stopwords' corpus from NLTK.
  2. Create a set of English stopwords.
  3. Filter out stopwords from a tokenized text and create a list of non-stopword words.

Solution

# Import the 'stopwords' corpus from NLTK
from nltk.corpus import stopwords

# Download the 'stopwords' data
nltk.download("stopwords")

# Create a set of English stopwords
stop_words = set(stopwords.words("english"))

# Initialize an empty list to hold the words that are not stopwords
filtered_list = []

# Tokenize the text 'story' into words and iterate through each word
for word in word_tokenize(story):
# Convert the word to casefold for case-insensitive comparison and check if it's not a stopword
if word.casefold() not in stop_words:
# If the word is not a stopword, append it to the filtered list
filtered_list.append(word)
# Print the result
filtered_list

Mark tasks as Completed
Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 1. Chapter 4
AVAILABLE TO ULTIMATE ONLY
some-alt