Contenido del Curso
Introduction to NLP
2. Stemming and Lemmatization
Introduction to NLP
Applying Text Preprocessing in Practice
Documents
First, before proceeding with a practical example of text preprocessing, it's important to understand the key components of a text corpus: documents.
Essentially, every text corpus is a set of documents, so preprocessing the corpus means preprocessing each of the documents.
Loading the Corpus
Previously, we had our text corpora as string variables. However, in real-world scenarios, a text corpus is often stored in TXT files for purely textual data or in CSV files with multiple columns when additional data is associated with the text.
In our course, we will work with either CSV files or TXT files, where each document starts from a new line. Therefore, we'll use the read_csv()
function from the pandas
library to load a text corpus from a file.
Let's take a look at an example:
Here, we read this TXT file into a DataFrame
. We set sep='\r'
to use the carriage return symbol as a separator, indicating each document starts on a new line. We use header=None
so the first line won't be considered a header, and we specify names=['Document']
to name the single column 'Document'. As a result, we will have a DataFrame
with a single column named 'Document' containing 6 documents (sentences).
Preprocessing the Corpus
In order to preprocess the corpus, let's first create a function for preprocessing each of the documents:
Code Description
def preprocess_document(doc):
Defines the
preprocess_document
function that takes a single string (document) doc
as input.doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I | re.A)
Removes all characters from
doc
except alphabetical characters and spaces, making the operation case-insensitive and ASCII-only.doc = doc.lower()
Converts the document to lowercase to ensure consistent case handling.
doc = doc.strip()
Trims leading and trailing whitespace from the document.
tokens = word_tokenize(doc)
Tokenizes the cleaned, lowercased document into individual words (tokens).
filtered_tokens = [token for token in tokens if token not in stop_words]
Filters out the stop words.
doc = ' '.join(filtered_tokens)
Joins the filtered tokens back into a single string, separated by spaces.
return doc
Returns the preprocessed document as a single string.
Let's now apply this function to our DataFrame
for each of the document and create a column with cleaned documents:
As you can see, our corpus is successfully preprocessed, so we will use the preprocessed version of this corpus later in the course.
¿Todo estuvo claro?
Contenido del Curso
Introduction to NLP
2. Stemming and Lemmatization
Introduction to NLP
Applying Text Preprocessing in Practice
Documents
First, before proceeding with a practical example of text preprocessing, it's important to understand the key components of a text corpus: documents.
Essentially, every text corpus is a set of documents, so preprocessing the corpus means preprocessing each of the documents.
Loading the Corpus
Previously, we had our text corpora as string variables. However, in real-world scenarios, a text corpus is often stored in TXT files for purely textual data or in CSV files with multiple columns when additional data is associated with the text.
In our course, we will work with either CSV files or TXT files, where each document starts from a new line. Therefore, we'll use the read_csv()
function from the pandas
library to load a text corpus from a file.
Let's take a look at an example:
Here, we read this TXT file into a DataFrame
. We set sep='\r'
to use the carriage return symbol as a separator, indicating each document starts on a new line. We use header=None
so the first line won't be considered a header, and we specify names=['Document']
to name the single column 'Document'. As a result, we will have a DataFrame
with a single column named 'Document' containing 6 documents (sentences).
Preprocessing the Corpus
In order to preprocess the corpus, let's first create a function for preprocessing each of the documents:
Code Description
def preprocess_document(doc):
Defines the
preprocess_document
function that takes a single string (document) doc
as input.doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I | re.A)
Removes all characters from
doc
except alphabetical characters and spaces, making the operation case-insensitive and ASCII-only.doc = doc.lower()
Converts the document to lowercase to ensure consistent case handling.
doc = doc.strip()
Trims leading and trailing whitespace from the document.
tokens = word_tokenize(doc)
Tokenizes the cleaned, lowercased document into individual words (tokens).
filtered_tokens = [token for token in tokens if token not in stop_words]
Filters out the stop words.
doc = ' '.join(filtered_tokens)
Joins the filtered tokens back into a single string, separated by spaces.
return doc
Returns the preprocessed document as a single string.
Let's now apply this function to our DataFrame
for each of the document and create a column with cleaned documents:
As you can see, our corpus is successfully preprocessed, so we will use the preprocessed version of this corpus later in the course.
¿Todo estuvo claro?