Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Customizing Bag of Words | Basic Text Models
Introduction to NLP

Customizing Bag of WordsCustomizing Bag of Words

The Bag of Words model, particularly its implementation through the CountVectorizer class, offers several parameters for customization. This allows it to be tailored to the specific needs of various text analysis tasks, significantly enhancing the model's effectiveness.

Minimum and Maximum Document Frequency

The min_df parameter defines the minimum number of documents a term must appear in to be included in the vocabulary, either as an absolute number or a proportion. It helps exclude rare terms, which are often less informative.

Similarly, max_df determines the maximum frequency a term can have across documents to remain in the vocabulary, also specifiable as an absolute number or proportion. It filters out overly common terms that don't contribute to distinguishing between documents.

Let's take a look at an example:

Setting max_df=3 excludes words that appear in more than 3 documents. In our corpus, these include words like "quick" and "brown". Given that they appear in every or in almost every document, they do not really help differentiate between documents. Alternatively, we could set max_df=0.6, as 60% of 5 documents is 3 documents.

N-gram Range

The ngram_range parameter allows you to define the range of n-gram sizes to be included in the vocabulary.

An n-gram is a contiguous sequence of n items from a given sample of text. These items are typically words (in our case), syllables, or letters.

By default, CountVectorizer considers only unigrams (single words). However, including bigrams (pairs of words), trigrams (triplets of words), or larger n-grams can enrich the model by capturing more context and semantic information, potentially improving performance.

This is achieved by passing a tuple (min_n, max_n) to the ngram_range parameter, where min_n represents the minimum n-gram size to include, and max_n represents the maximum size.

Let's now focus exclusively on trigrams that appear in two or more documents within our corpus:

These are the most commonly used parameters, however, in case you want to explore more of them, you can refer to the documentation.

Tudo estava claro?

Seção 3. Capítulo 4
course content

Conteúdo do Curso

Introduction to NLP

Customizing Bag of WordsCustomizing Bag of Words

The Bag of Words model, particularly its implementation through the CountVectorizer class, offers several parameters for customization. This allows it to be tailored to the specific needs of various text analysis tasks, significantly enhancing the model's effectiveness.

Minimum and Maximum Document Frequency

The min_df parameter defines the minimum number of documents a term must appear in to be included in the vocabulary, either as an absolute number or a proportion. It helps exclude rare terms, which are often less informative.

Similarly, max_df determines the maximum frequency a term can have across documents to remain in the vocabulary, also specifiable as an absolute number or proportion. It filters out overly common terms that don't contribute to distinguishing between documents.

Let's take a look at an example:

Setting max_df=3 excludes words that appear in more than 3 documents. In our corpus, these include words like "quick" and "brown". Given that they appear in every or in almost every document, they do not really help differentiate between documents. Alternatively, we could set max_df=0.6, as 60% of 5 documents is 3 documents.

N-gram Range

The ngram_range parameter allows you to define the range of n-gram sizes to be included in the vocabulary.

An n-gram is a contiguous sequence of n items from a given sample of text. These items are typically words (in our case), syllables, or letters.

By default, CountVectorizer considers only unigrams (single words). However, including bigrams (pairs of words), trigrams (triplets of words), or larger n-grams can enrich the model by capturing more context and semantic information, potentially improving performance.

This is achieved by passing a tuple (min_n, max_n) to the ngram_range parameter, where min_n represents the minimum n-gram size to include, and max_n represents the maximum size.

Let's now focus exclusively on trigrams that appear in two or more documents within our corpus:

These are the most commonly used parameters, however, in case you want to explore more of them, you can refer to the documentation.

Tudo estava claro?

Seção 3. Capítulo 4
some-alt