Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Applying Text Preprocessing in Practice | Text Preprocessing Fundamentals
Introduction to NLP
course content

Course Content

Introduction to NLP

Introduction to NLP

1. Text Preprocessing Fundamentals
2. Stemming and Lemmatization
3. Basic Text Models
4. Word Embeddings

book
Applying Text Preprocessing in Practice

Documents

First, before proceeding with a practical example of text preprocessing, it's important to understand the key components of a text corpus: documents.

Essentially, every text corpus is a set of documents, so preprocessing the corpus means preprocessing each of the documents.

Loading the Corpus

Previously, we had our text corpora as string variables. However, in real-world scenarios, a text corpus is often stored in TXT files for purely textual data or in CSV files with multiple columns when additional data is associated with the text.

In our course, we will work with either CSV files or TXT files, where each document starts from a new line. Therefore, we'll use the read_csv() function from the pandas library to load a text corpus from a file.

Let's take a look at an example:

123456
import pandas as pd corpus = pd.read_csv( 'https://content-media-cdn.codefinity.com/courses/c68c1f2e-2c90-4d5d-8db9-1e97ca89d15e/section_1/chapter_8/example_corpus.txt', sep='\r', header=None, names=['Document']) print(corpus)
copy

Here, we read this TXT file into a DataFrame. We set sep='\r' to use the carriage return symbol as a separator, indicating each document starts on a new line. We use header=None so the first line won't be considered a header, and we specify names=['Document'] to name the single column 'Document'. As a result, we will have a DataFrame with a single column named 'Document' containing 6 documents (sentences).

Preprocessing the Corpus

In order to preprocess the corpus, let's first create a function for preprocessing each of the documents:

123456789101112131415161718
import re from nltk.tokenize import word_tokenize from nltk.corpus import stopwords import nltk nltk.download('punkt_tab') nltk.download('stopwords') stop_words = set(stopwords.words('english')) def preprocess_document(doc): doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I | re.A) doc = doc.lower() doc = doc.strip() tokens = word_tokenize(doc) filtered_tokens = [token for token in tokens if token not in stop_words] doc = ' '.join(filtered_tokens) return doc
copy

Let's now apply this function to our DataFrame for each of the document and create a column with cleaned documents:

123456789101112131415161718192021222324252627
import re from nltk.tokenize import word_tokenize from nltk.corpus import stopwords import nltk import pandas as pd nltk.download('punkt_tab') nltk.download('stopwords') stop_words = set(stopwords.words('english')) def preprocess_document(doc): doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I | re.A) doc = doc.lower() doc = doc.strip() tokens = word_tokenize(doc) filtered_tokens = [token for token in tokens if token not in stop_words] doc = ' '.join(filtered_tokens) return doc corpus = pd.read_csv( 'https://content-media-cdn.codefinity.com/courses/c68c1f2e-2c90-4d5d-8db9-1e97ca89d15e/section_1/chapter_8/example_corpus.txt', sep='\r', header=None, names=['Document']) corpus['Cleaned_Document'] = corpus['Document'].apply(preprocess_document) print(corpus)
copy

As you can see, our corpus is successfully preprocessed, so we will use the preprocessed version of this corpus later in the course.

Select the correct output of the following code snippet.

Select the correct output of the following code snippet.

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 1. Chapter 8
We're sorry to hear that something went wrong. What happened?
some-alt