Course Content
Introduction to NLP
Introduction to NLP
Applying Text Preprocessing in Practice
Documents
First, before proceeding with a practical example of text preprocessing, it's important to understand the key components of a text corpus: documents.
Essentially, every text corpus is a set of documents, so preprocessing the corpus means preprocessing each of the documents.
Loading the Corpus
Previously, we had our text corpora as string variables. However, in real-world scenarios, a text corpus is often stored in TXT files for purely textual data or in CSV files with multiple columns when additional data is associated with the text.
In our course, we will work with either CSV files or TXT files, where each document starts from a new line. Therefore, we'll use the read_csv()
function from the pandas
library to load a text corpus from a file.
Let's take a look at an example:
import pandas as pd corpus = pd.read_csv( 'https://content-media-cdn.codefinity.com/courses/c68c1f2e-2c90-4d5d-8db9-1e97ca89d15e/section_1/chapter_8/example_corpus.txt', sep='\r', header=None, names=['Document']) print(corpus)
Here, we read this TXT file into a DataFrame
. We set sep='\r'
to use the carriage return symbol as a separator, indicating each document starts on a new line. We use header=None
so the first line won't be considered a header, and we specify names=['Document']
to name the single column 'Document'. As a result, we will have a DataFrame
with a single column named 'Document' containing 6 documents (sentences).
Preprocessing the Corpus
In order to preprocess the corpus, let's first create a function for preprocessing each of the documents:
import re from nltk.tokenize import word_tokenize from nltk.corpus import stopwords import nltk nltk.download('punkt_tab') nltk.download('stopwords') stop_words = set(stopwords.words('english')) def preprocess_document(doc): doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I | re.A) doc = doc.lower() doc = doc.strip() tokens = word_tokenize(doc) filtered_tokens = [token for token in tokens if token not in stop_words] doc = ' '.join(filtered_tokens) return doc
Let's now apply this function to our DataFrame
for each of the document and create a column with cleaned documents:
import re from nltk.tokenize import word_tokenize from nltk.corpus import stopwords import nltk import pandas as pd nltk.download('punkt_tab') nltk.download('stopwords') stop_words = set(stopwords.words('english')) def preprocess_document(doc): doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I | re.A) doc = doc.lower() doc = doc.strip() tokens = word_tokenize(doc) filtered_tokens = [token for token in tokens if token not in stop_words] doc = ' '.join(filtered_tokens) return doc corpus = pd.read_csv( 'https://content-media-cdn.codefinity.com/courses/c68c1f2e-2c90-4d5d-8db9-1e97ca89d15e/section_1/chapter_8/example_corpus.txt', sep='\r', header=None, names=['Document']) corpus['Cleaned_Document'] = corpus['Document'].apply(preprocess_document) print(corpus)
As you can see, our corpus is successfully preprocessed, so we will use the preprocessed version of this corpus later in the course.
Thanks for your feedback!