Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Bag of Words | Basic Text Models
Introduction to NLP

Bag of WordsBag of Words

Understanding the BoW Model

As we have already mentioned in the previous chapter, the bag of words (BoW) model represents documents as vectors where each dimension corresponds to a unique word. Each dimension can either represent the presence of a word within the document (1 if present, 0 if absent) or its frequency (word count). Therefore, BoW models can be either binary or frequency-based.

Let's take a look at how the same sentence (document) is represented by each type:

Bag of words classification

As you can see, a binary model represents this document as the [1, 1, 1] vector, while frequency-based models represent it as [2, 1, 2], taking word frequency into account.

BoW Implementation

Let's now delve into the BoW model implementation in Python. Implementing the BoW model can be a straightforward process, especially with the help of the sklearn (Scikit-learn) library and its CountVectorizer class.

Without further ado, let's proceed with an example of a binary bag of words:

Code Description
from sklearn.feature_extraction.text import CountVectorizer

This line imports the CountVectorizer class from sklearn.

corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ]

These lines define a list named corpus that contains three string elements. Each string is a text document.

vectorizer = CountVectorizer(binary=True)

This line creates an instance of the CountVectorizer class. The binary=True parameter specifies that the model should use binary counts.

bow_matrix = vectorizer.fit_transform(corpus)

The method first fits the model to the data, learning the vocabulary of the corpus, and then transforms the text documents into a sparse matrix of token counts (bag of words matrix).

print(bow_matrix.toarray())

This line converts the sparse matrix bow_matrix into a dense array (numpy.ndarray) using the .toarray() method and prints it.

Each row in the matrix corresponds to a document, and each column to a token (word). In order to visually represent it, we converted this sparse matrix into a dense 2D array using the .toarray() method.

A sparse matrix is a matrix in which most of the elements are zero. It is used to efficiently represent and process data with a high volume of zero values, saving memory and computational resources by only storing and manipulating the non-zero elements.

In order to create a frequency-based bag of words model, all we have to do is remove the parameter binary=True since the default value for it is False:

Converting the Matrix to a DataFrame

It can be quite convenient to convert the resulting bag of words matrix into a pandas DataFrame. Moreover, the CountVectorizer instance offers the get_feature_names_out() method, which retrieves an array of unique words (feature names) used in the model. These feature names can set as the columns of the resulting DataFrame, here is an example:

With this representation, we can now easily access not only the vector for a particular document, but the vector of a particular word:

Since each unique word corresponds to a column, accessing a word vector is as simple as accessing a column in the DataFrame by specifying the word (for example, 'global'). We also use the values attribute to obtain an array instead of a Series as the result.

question-icon

Given a BoW matrix, what do different components of this matrix represent?

Rows:
Columns:

A particular element of the matrix:

Clique ou arraste solte itens e preencha os espaços

Tudo estava claro?

Seção 3. Capítulo 3
course content

Conteúdo do Curso

Introduction to NLP

Bag of WordsBag of Words

Understanding the BoW Model

As we have already mentioned in the previous chapter, the bag of words (BoW) model represents documents as vectors where each dimension corresponds to a unique word. Each dimension can either represent the presence of a word within the document (1 if present, 0 if absent) or its frequency (word count). Therefore, BoW models can be either binary or frequency-based.

Let's take a look at how the same sentence (document) is represented by each type:

Bag of words classification

As you can see, a binary model represents this document as the [1, 1, 1] vector, while frequency-based models represent it as [2, 1, 2], taking word frequency into account.

BoW Implementation

Let's now delve into the BoW model implementation in Python. Implementing the BoW model can be a straightforward process, especially with the help of the sklearn (Scikit-learn) library and its CountVectorizer class.

Without further ado, let's proceed with an example of a binary bag of words:

Code Description
from sklearn.feature_extraction.text import CountVectorizer

This line imports the CountVectorizer class from sklearn.

corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ]

These lines define a list named corpus that contains three string elements. Each string is a text document.

vectorizer = CountVectorizer(binary=True)

This line creates an instance of the CountVectorizer class. The binary=True parameter specifies that the model should use binary counts.

bow_matrix = vectorizer.fit_transform(corpus)

The method first fits the model to the data, learning the vocabulary of the corpus, and then transforms the text documents into a sparse matrix of token counts (bag of words matrix).

print(bow_matrix.toarray())

This line converts the sparse matrix bow_matrix into a dense array (numpy.ndarray) using the .toarray() method and prints it.

Each row in the matrix corresponds to a document, and each column to a token (word). In order to visually represent it, we converted this sparse matrix into a dense 2D array using the .toarray() method.

A sparse matrix is a matrix in which most of the elements are zero. It is used to efficiently represent and process data with a high volume of zero values, saving memory and computational resources by only storing and manipulating the non-zero elements.

In order to create a frequency-based bag of words model, all we have to do is remove the parameter binary=True since the default value for it is False:

Converting the Matrix to a DataFrame

It can be quite convenient to convert the resulting bag of words matrix into a pandas DataFrame. Moreover, the CountVectorizer instance offers the get_feature_names_out() method, which retrieves an array of unique words (feature names) used in the model. These feature names can set as the columns of the resulting DataFrame, here is an example:

With this representation, we can now easily access not only the vector for a particular document, but the vector of a particular word:

Since each unique word corresponds to a column, accessing a word vector is as simple as accessing a column in the DataFrame by specifying the word (for example, 'global'). We also use the values attribute to obtain an array instead of a Series as the result.

question-icon

Given a BoW matrix, what do different components of this matrix represent?

Rows:
Columns:

A particular element of the matrix:

Clique ou arraste solte itens e preencha os espaços

Tudo estava claro?

Seção 3. Capítulo 3
some-alt