Applying Text Preprocessing in Practice

Documents

First, before proceeding with a practical example of text preprocessing, it's important to understand the key components of a text corpus: documents.

Essentially, every text corpus is a set of documents, so preprocessing the corpus means preprocessing each of the documents.

Loading the Corpus

Previously, we had our text corpora as string variables. However, in real-world scenarios, a text corpus is often stored in TXT files for purely textual data or in CSV files with multiple columns when additional data is associated with the text.

In our course, we will work with either CSV files or TXT files, where each document starts from a new line. Therefore, we'll use the read_csv() function from the pandas library to load a text corpus from a file.

Let's take a look at an example:


              123456
            
import pandas as pd

corpus = pd.read_csv(
    'https://content-media-cdn.codefinity.com/courses/c68c1f2e-2c90-4d5d-8db9-1e97ca89d15e/section_1/chapter_8/example_corpus.txt',
    sep='\r', header=None, names=['Document'])
print(corpus)

Here, we read this TXT file into a DataFrame. We set sep='\r' to use the carriage return symbol as a separator, indicating each document starts on a new line. We use header=None so the first line won't be considered a header, and we specify names=['Document'] to name the single column 'Document'. As a result, we will have a DataFrame with a single column named 'Document' containing 6 documents (sentences).

Preprocessing the Corpus

In order to preprocess the corpus, let's first create a function for preprocessing each of the documents:


              123456789101112131415161718
            
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk

nltk.download('punkt_tab')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))


def preprocess_document(doc):
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I | re.A)
    doc = doc.lower()
    doc = doc.strip()
    tokens = word_tokenize(doc)
    filtered_tokens = [token for token in tokens if token not in stop_words]
    doc = ' '.join(filtered_tokens)
    return doc

Let's now apply this function to our DataFrame for each of the document and create a column with cleaned documents:


              123456789101112131415161718192021222324252627
            
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
import pandas as pd

nltk.download('punkt_tab')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))


def preprocess_document(doc):
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I | re.A)
    doc = doc.lower()
    doc = doc.strip()
    tokens = word_tokenize(doc)
    filtered_tokens = [token for token in tokens if token not in stop_words]
    doc = ' '.join(filtered_tokens)
    return doc


corpus = pd.read_csv(
    'https://content-media-cdn.codefinity.com/courses/c68c1f2e-2c90-4d5d-8db9-1e97ca89d15e/section_1/chapter_8/example_corpus.txt',
    sep='\r', header=None, names=['Document'])

corpus['Cleaned_Document'] = corpus['Document'].apply(preprocess_document)
print(corpus)

As you can see, our corpus is successfully preprocessed, so we will use the preprocessed version of this corpus later in the course.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 1. Capítulo 8

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Conteúdo do Curso

Introduction to NLP