Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Key Types of Vector Space Models | Basic Text Models
course content

Conteúdo do Curso

Introduction to NLP

Key Types of Vector Space ModelsKey Types of Vector Space Models

Vector space models can be broadly classified based on the nature of the representation they provide, each with unique characteristics and use cases. Let's now discuss the key concepts around these models, deferring their implementation for later chapters.

Bag of Words

Bag of Words (BoW) is a vector space model which represents documents as vectors where each dimension corresponds to a unique word. It can be binary (indicating word presence) or frequency-based (indicating word count).

Here is an example of a a frequency-based BoW:

Bag of Words

As you can see, each document is represented by a vector, with each dimension corresponding to the frequency of a specific word within that document. In the case of a binary bag-of-words model, each vector would contain only 0 or 1 for each word, indicating its absence or presence, respectively.

Before using BoW and similar text models, each document should be preprocessed, so each token will be a lowercase word.

TF-IDF

The TF-IDF (Term Frequency-Inverse Document Frequency) model extends the Bag of Words (BoW) approach by adjusting word frequencies based on their occurrence across all documents. It emphasizes words that are unique to a document, thereby providing more specific insights into the document's content.

This is achieved by combining the term frequency (the number of times a word appears in a document) with the inverse document frequency (a measure of how common or rare a word is across the entire dataset).

Let's modify our previous example with this model:

TF-IDF model

In one of the upcoming chapters, we will learn how to calculate the TF-IDF value for each word. For now, it's important to note that the resulting vectors, enriched by TF-IDF, display greater variety, offering deeper insights into the document's content.

Words Embeddings and Document Embeddings

We have already mentioned word embeddings in the previous chapter. Essentially, this model maps individual words to dense vectors in a low-dimensional, continuous space, capturing semantic similarities, which are not actually directly interpretable.

Document embeddings, on the other hand, generate dense vectors representing whole documents, capturing the overall semantic meaning.

We determine the dimensionality (size) of these embeddings based on the specific requirements of our project and the computational resources available. This decision is crucial for balancing the embeddings' ability to capture semantic nuances and the efficiency of our models

Let's take a look at an example with the word embeddings for the words "cat", "kitten", "dog", and "house":

Word Embeddings

We have chosen the size of the embeddings to be 6. Although the numerical values are arbitrary, they effectively demonstrate how the embeddings accurately reflect the similarities among words.

In a real-world scenario, these embeddings would be derived from training a model on a text corpus, allowing it to 'learn' the nuanced relationships between words based on actual language use. We will accomplish this in one of the upcoming chapters, stay tuned!

A further advancement in dense representations, contextual embeddings (generated by models like BERT and GPT), consider the context in which a word appears to generate its vector. This means the same word can have different embeddings based on its usage in different sentences, providing a nuanced understanding of language. This topic is too advanced for our course, however, so let's defer it for later.
question-icon

Order the models by their complexity, from simplest to most complex.

1.
2.

3.

4.

Clique ou arraste solte itens e preencha os espaços

Tudo estava claro?

Seção 3. Capítulo 2
course content

Conteúdo do Curso

Introduction to NLP

Key Types of Vector Space ModelsKey Types of Vector Space Models

Vector space models can be broadly classified based on the nature of the representation they provide, each with unique characteristics and use cases. Let's now discuss the key concepts around these models, deferring their implementation for later chapters.

Bag of Words

Bag of Words (BoW) is a vector space model which represents documents as vectors where each dimension corresponds to a unique word. It can be binary (indicating word presence) or frequency-based (indicating word count).

Here is an example of a a frequency-based BoW:

Bag of Words

As you can see, each document is represented by a vector, with each dimension corresponding to the frequency of a specific word within that document. In the case of a binary bag-of-words model, each vector would contain only 0 or 1 for each word, indicating its absence or presence, respectively.

Before using BoW and similar text models, each document should be preprocessed, so each token will be a lowercase word.

TF-IDF

The TF-IDF (Term Frequency-Inverse Document Frequency) model extends the Bag of Words (BoW) approach by adjusting word frequencies based on their occurrence across all documents. It emphasizes words that are unique to a document, thereby providing more specific insights into the document's content.

This is achieved by combining the term frequency (the number of times a word appears in a document) with the inverse document frequency (a measure of how common or rare a word is across the entire dataset).

Let's modify our previous example with this model:

TF-IDF model

In one of the upcoming chapters, we will learn how to calculate the TF-IDF value for each word. For now, it's important to note that the resulting vectors, enriched by TF-IDF, display greater variety, offering deeper insights into the document's content.

Words Embeddings and Document Embeddings

We have already mentioned word embeddings in the previous chapter. Essentially, this model maps individual words to dense vectors in a low-dimensional, continuous space, capturing semantic similarities, which are not actually directly interpretable.

Document embeddings, on the other hand, generate dense vectors representing whole documents, capturing the overall semantic meaning.

We determine the dimensionality (size) of these embeddings based on the specific requirements of our project and the computational resources available. This decision is crucial for balancing the embeddings' ability to capture semantic nuances and the efficiency of our models

Let's take a look at an example with the word embeddings for the words "cat", "kitten", "dog", and "house":

Word Embeddings

We have chosen the size of the embeddings to be 6. Although the numerical values are arbitrary, they effectively demonstrate how the embeddings accurately reflect the similarities among words.

In a real-world scenario, these embeddings would be derived from training a model on a text corpus, allowing it to 'learn' the nuanced relationships between words based on actual language use. We will accomplish this in one of the upcoming chapters, stay tuned!

A further advancement in dense representations, contextual embeddings (generated by models like BERT and GPT), consider the context in which a word appears to generate its vector. This means the same word can have different embeddings based on its usage in different sentences, providing a nuanced understanding of language. This topic is too advanced for our course, however, so let's defer it for later.
question-icon

Order the models by their complexity, from simplest to most complex.

1.
2.

3.

4.

Clique ou arraste solte itens e preencha os espaços

Tudo estava claro?

Seção 3. Capítulo 2
some-alt