Customizing Bag of Words

The Bag of Words model, particularly its implementation through the CountVectorizer class, offers several parameters for customization. This allows it to be tailored to the specific needs of various text analysis tasks, significantly enhancing the model's effectiveness.

Minimum and Maximum Document Frequency

The min_df parameter defines the minimum number of documents a term must appear in to be included in the vocabulary, either as an absolute number or a proportion. It helps exclude rare terms, which are often less informative.

Similarly, max_df determines the maximum frequency a term can have across documents to remain in the vocabulary, also specifiable as an absolute number or proportion. It filters out overly common terms that don't contribute to distinguishing between documents.

Let's take a look at an example:


              123456789101112131415
            
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "Quick brown foxes leap over lazy dogs in summer.",
    "The quick brown fox is often seen jumping over lazy dogs.",
    "In summer, the lazy dog plays while the quick brown fox rests.",
    "A quick brown fox is quicker than the laziest dog."
]
# Exclude words which appear in more than 3 documents
vectorizer = CountVectorizer(max_df=3)
bow_matrix = vectorizer.fit_transform(corpus)
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print(bow_df)

Setting max_df=3 excludes words that appear in more than 3 documents. In our corpus, these include words like "quick" and "brown". Given that they appear in every or in almost every document, they do not really help differentiate between documents. Alternatively, we could set max_df=0.6, as 60% of 5 documents is 3 documents.

N-gram Range

The ngram_range parameter allows you to define the range of n-gram sizes to be included in the vocabulary.

By default, CountVectorizer considers only unigrams (single words). However, including bigrams (pairs of words), trigrams (triplets of words), or larger n-grams can enrich the model by capturing more context and semantic information, potentially improving performance.

This is achieved by passing a tuple (min_n, max_n) to the ngram_range parameter, where min_n represents the minimum n-gram size to include, and max_n represents the maximum size.

Let's now focus exclusively on trigrams that appear in two or more documents within our corpus:


              123456789101112131415
            
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "Quick brown foxes leap over lazy dogs in summer.",
    "The quick brown fox is often seen jumping over lazy dogs.",
    "In summer, the lazy dog plays while the quick brown fox rests.",
    "A quick brown fox is quicker than the laziest dog."
]
# Include trigrams which appear in 2 or more documents
vectorizer = CountVectorizer(min_df=2, ngram_range=(3, 3))
bow_matrix = vectorizer.fit_transform(corpus)
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print(bow_df)

These are the most commonly used parameters, however, in case you want to explore more of them, you can refer to the documentation.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 3. Capítulo 4

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Conteúdo do Curso

Introduction to NLP