Course Content
Introduction to NLP
Introduction to NLP
Removing Stop Words
Understanding Stop Words
In NLP, the process of removing stop words is a crucial step in text preprocessing.
Stop words are typically filtered out after tokenization for NLP tasks, such as sentiment analysis, topic modeling, or keyword extraction. The rationale behind removing stop words is to decrease the dataset size, thereby improving computational efficiency, and to increase the relevance of the analysis by focusing on the words that carry significant meaning.
Removing Stop Words with NLTK
To make things easier, nltk
provides a comprehensive list of stop words in multiple languages, which can be easily accessed and used to filter stop words from text data.
Here’s how you can get the list of English stop words in NLTK and convert it to set:
import nltk from nltk.corpus import stopwords # Download the stop words list nltk.download('stopwords') # Load English stop words stop_words = set(stopwords.words('english')) print(stop_words)
With this in mind, let's take a look at a complete example of how to filter out stop words from a given text:
import nltk from nltk.tokenize import word_tokenize from nltk.corpus import stopwords nltk.download('punkt_tab') nltk.download('stopwords') stop_words = set(stopwords.words('english')) text = "This is an example sentence demonstrating the removal of stop words." text = text.lower() # Tokenize the text tokens = word_tokenize(text) # Remove stop words filtered_tokens = [word for word in tokens if word.lower() not in stop_words] print("Original Tokens:", tokens) print("Filtered Tokens:", filtered_tokens)
As you can see, we should first download the stop words and perform tokenization. The next step is to use a list comprehension to create a list containing only tokens which are not stop words. The word.lower()
in the if
clause is essential to convert each word (token) to lower case, since nltk
contains stop words exclusively in lower case.
Swipe to show code editor
Your task is to convert the text
to lowercase, load the English stop words list from nltk
and convert it to a set, then tokenize the text
string using the word_tokenize()
function, and filter out the stop words from tokens
using list comprehension.
Solution
Thanks for your feedback!
Removing Stop Words
Understanding Stop Words
In NLP, the process of removing stop words is a crucial step in text preprocessing.
Stop words are typically filtered out after tokenization for NLP tasks, such as sentiment analysis, topic modeling, or keyword extraction. The rationale behind removing stop words is to decrease the dataset size, thereby improving computational efficiency, and to increase the relevance of the analysis by focusing on the words that carry significant meaning.
Removing Stop Words with NLTK
To make things easier, nltk
provides a comprehensive list of stop words in multiple languages, which can be easily accessed and used to filter stop words from text data.
Here’s how you can get the list of English stop words in NLTK and convert it to set:
import nltk from nltk.corpus import stopwords # Download the stop words list nltk.download('stopwords') # Load English stop words stop_words = set(stopwords.words('english')) print(stop_words)
With this in mind, let's take a look at a complete example of how to filter out stop words from a given text:
import nltk from nltk.tokenize import word_tokenize from nltk.corpus import stopwords nltk.download('punkt_tab') nltk.download('stopwords') stop_words = set(stopwords.words('english')) text = "This is an example sentence demonstrating the removal of stop words." text = text.lower() # Tokenize the text tokens = word_tokenize(text) # Remove stop words filtered_tokens = [word for word in tokens if word.lower() not in stop_words] print("Original Tokens:", tokens) print("Filtered Tokens:", filtered_tokens)
As you can see, we should first download the stop words and perform tokenization. The next step is to use a list comprehension to create a list containing only tokens which are not stop words. The word.lower()
in the if
clause is essential to convert each word (token) to lower case, since nltk
contains stop words exclusively in lower case.
Swipe to show code editor
Your task is to convert the text
to lowercase, load the English stop words list from nltk
and convert it to a set, then tokenize the text
string using the word_tokenize()
function, and filter out the stop words from tokens
using list comprehension.
Solution
Thanks for your feedback!