Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Applying Text Preprocessing in Practice | Text Preprocessing Fundamentals
Introduction to NLP

Applying Text Preprocessing in PracticeApplying Text Preprocessing in Practice

Documents

First, before proceeding with a practical example of text preprocessing, it's important to understand the key components of a text corpus: documents.

A document is a separate piece of text within a corpus, for example, an email within a corpus of emails.

Essentially, every text corpus is a set of documents, so preprocessing the corpus means preprocessing each of the documents.

Loading the Corpus

Previously, we had our text corpora as string variables. However, in real-world scenarios, a text corpus is often stored in TXT files for purely textual data or in CSV files with multiple columns when additional data is associated with the text.

In our course, we will work with either CSV files or TXT files, where each document starts from a new line. Therefore, we'll use the read_csv() function from the pandas library to load a text corpus from a file.

Let's take a look at an example:

Here, we read this TXT file into a DataFrame. We set sep='\r' to use the carriage return symbol as a separator, indicating each document starts on a new line. We use header=None so the first line won't be considered a header, and we specify names=['Document'] to name the single column 'Document'. As a result, we will have a DataFrame with a single column named 'Document' containing 6 documents (sentences).

Preprocessing the Corpus

In order to preprocess the corpus, let's first create a function for preprocessing each of the documents:

Code Description
def preprocess_document(doc):

Defines the preprocess_document function that takes a single string (document) doc as input.

doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I | re.A)

Removes all characters from doc except alphabetical characters and spaces, making the operation case-insensitive and ASCII-only.

doc = doc.lower()

Converts the document to lowercase to ensure consistent case handling.

doc = doc.strip()

Trims leading and trailing whitespace from the document.

tokens = word_tokenize(doc)

Tokenizes the cleaned, lowercased document into individual words (tokens).

filtered_tokens = [token for token in tokens if token not in stop_words]

Filters out the stop words.

doc = ' '.join(filtered_tokens)

Joins the filtered tokens back into a single string, separated by spaces.

return doc

Returns the preprocessed document as a single string.

Let's now apply this function to our DataFrame for each of the document and create a column with cleaned documents:

As you can see, our corpus is successfully preprocessed, so we will use the preprocessed version of this corpus later in the course.

Select the correct output of the following code snippet.

Selecione a resposta correta

Tudo estava claro?

Seção 1. Capítulo 8
course content

Conteúdo do Curso

Introduction to NLP

Applying Text Preprocessing in PracticeApplying Text Preprocessing in Practice

Documents

First, before proceeding with a practical example of text preprocessing, it's important to understand the key components of a text corpus: documents.

A document is a separate piece of text within a corpus, for example, an email within a corpus of emails.

Essentially, every text corpus is a set of documents, so preprocessing the corpus means preprocessing each of the documents.

Loading the Corpus

Previously, we had our text corpora as string variables. However, in real-world scenarios, a text corpus is often stored in TXT files for purely textual data or in CSV files with multiple columns when additional data is associated with the text.

In our course, we will work with either CSV files or TXT files, where each document starts from a new line. Therefore, we'll use the read_csv() function from the pandas library to load a text corpus from a file.

Let's take a look at an example:

Here, we read this TXT file into a DataFrame. We set sep='\r' to use the carriage return symbol as a separator, indicating each document starts on a new line. We use header=None so the first line won't be considered a header, and we specify names=['Document'] to name the single column 'Document'. As a result, we will have a DataFrame with a single column named 'Document' containing 6 documents (sentences).

Preprocessing the Corpus

In order to preprocess the corpus, let's first create a function for preprocessing each of the documents:

Code Description
def preprocess_document(doc):

Defines the preprocess_document function that takes a single string (document) doc as input.

doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I | re.A)

Removes all characters from doc except alphabetical characters and spaces, making the operation case-insensitive and ASCII-only.

doc = doc.lower()

Converts the document to lowercase to ensure consistent case handling.

doc = doc.strip()

Trims leading and trailing whitespace from the document.

tokens = word_tokenize(doc)

Tokenizes the cleaned, lowercased document into individual words (tokens).

filtered_tokens = [token for token in tokens if token not in stop_words]

Filters out the stop words.

doc = ' '.join(filtered_tokens)

Joins the filtered tokens back into a single string, separated by spaces.

return doc

Returns the preprocessed document as a single string.

Let's now apply this function to our DataFrame for each of the document and create a column with cleaned documents:

As you can see, our corpus is successfully preprocessed, so we will use the preprocessed version of this corpus later in the course.

Select the correct output of the following code snippet.

Selecione a resposta correta

Tudo estava claro?

Seção 1. Capítulo 8
some-alt