Key Types of Vector Space Models

Vector space models can be broadly classified based on the nature of the representation they provide, each with unique characteristics and use cases. Let's now discuss the key concepts around these models, deferring their implementation for later chapters.

Bag of Words

Bag of Words (BoW) is a vector space model which represents documents as vectors where each dimension corresponds to a unique word. It can be binary (indicating word presence) or frequency-based (indicating word count).

Here is an example of a a frequency-based BoW:

As you can see, each document is represented by a vector, with each dimension corresponding to the frequency of a specific word within that document. In the case of a binary bag-of-words model, each vector would contain only 0 or 1 for each word, indicating its absence or presence, respectively.

TF-IDF

The TF-IDF (Term Frequency-Inverse Document Frequency) model extends the Bag of Words (BoW) approach by adjusting word frequencies based on their occurrence across all documents. It emphasizes words that are unique to a document, thereby providing more specific insights into the document's content.

This is achieved by combining the term frequency (the number of times a word appears in a document) with the inverse document frequency (a measure of how common or rare a word is across the entire dataset).

Let's modify our previous example with this model:

In one of the upcoming chapters, we will learn how to calculate the TF-IDF value for each word. For now, it's important to note that the resulting vectors, enriched by TF-IDF, display greater variety, offering deeper insights into the document's content.

Words Embeddings and Document Embeddings

We have already mentioned word embeddings in the previous chapter. Essentially, this model maps individual words to dense vectors in a low-dimensional, continuous space, capturing semantic similarities, which are not actually directly interpretable.

Document embeddings, on the other hand, generate dense vectors representing whole documents, capturing the overall semantic meaning.

Let's take a look at an example with the word embeddings for the words "cat", "kitten", "dog", and "house":

We have chosen the size of the embeddings to be 6. Although the numerical values are arbitrary, they effectively demonstrate how the embeddings accurately reflect the similarities among words.

In a real-world scenario, these embeddings would be derived from training a model on a text corpus, allowing it to 'learn' the nuanced relationships between words based on actual language use. We will accomplish this in one of the upcoming chapters, stay tuned!

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 3. Capítulo 2

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Conteúdo do Curso

Introduction to NLP