Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
CBoW and Skip-gram Models (Optional) | Word Embeddings
Introduction to NLP

CBoW and Skip-gram Models (Optional)CBoW and Skip-gram Models (Optional)

In order to fully understand what's going on in this chapter, we suggest taking the Introduction to Neural Networks course.

Both the CBoW and Skip-gram architectures learn word embeddings through a neural network structure comprising of the following layers:

  • an input layer;
  • a single hidden layer;
  • an output layer.

The weight matrix between the input and hidden layers, denoted as W1 or E, serves as the embeddings matrix. Each row of this matrix represents an embedding vector for a corresponding word, with the i-th row matching the i-th word in the vocabulary.

This matrix contains V (vocabulary size) embeddings, each of size N, a dimension we specify. Multiplying the transpose of this matrix (NxV matrix) by a one-hot encoded vector (Vx1 vector) retrieves the embedding for a specific word, producing an Nx1 vector.

Embeddings matrix

The second weight matrix, between the hidden and output layers, is sized NxV. Multiplying the transpose of this matrix (VxN matrix) by the hidden layer's Nx1 vector results in a Vx1 vector.

CBoW

Now, consider an example using a CBoW model:

CBoW

First, the transpose of the embeddings matrix is multiplied by the one-hot vectors of the context words to produce their embeddings. These embeddings are then summed or averaged depending on the implementation to form a single vector. This vector is multiplied by the W2 matrix, resulting in a Vx1 vector.

Finally, this vector passes through the softmax activation function, converting it into a probability distribution, where each element represents the probability of a vocabulary word being the target word.

Softmax

Afterward, the loss is calculated, and both weight matrices are updated to minimize this loss. Ideally, we want the probability of the target word to be close to 1, while the probabilities for all other words approach zero. This process is repeated for every combination of a target word and its context words.

Once all combinations have been processed, an epoch is completed. Typically, the neural network is trained over several epochs to ensure accurate learning. Finally, the rows of the resulting embeddings matrix can be used as our word embeddings. Each row corresponds to the vector representation of a specific word in the vocabulary, effectively capturing its semantic properties within the trained model.

Skip-gram

Let's now take a look at a skip-gram model:

Skip-gram

As you can see, the process is mostly similar to CBoW. It begins by retrieving the embedding of the target word, which is then used in the hidden layer. This is followed by producing a Vx1 vector in the output layer. This vector, obtained by multiplying the target word's embedding with the output layer's weight matrix, is then transformed by the softmax activation function into a vector of probabilities.

Although this resulting vector of probabilities is the same for all context words associated with a single target word during a single training step, the loss for each context word is calculated individually.

The loss for each context word is summed up, and the weight matrices are updated accordingly at each iteration to minimize the total loss. Once the specified number of epochs is completed, the embeddings matrix can be used to obtain the word embeddings.

In practice, especially with large vocabularies, the softmax function can be computationally intensive. Therefore, approximations like negative sampling are often used to make the computation more efficient. It approximates the softmax and significantly speed up training, but they serve the same purpose as an activation function at the output layer to model the probability distribution over words. This topic, however, is a bit too advanced for our course.

Tudo estava claro?

Seção 4. Capítulo 2
course content

Conteúdo do Curso

Introduction to NLP

CBoW and Skip-gram Models (Optional)CBoW and Skip-gram Models (Optional)

In order to fully understand what's going on in this chapter, we suggest taking the Introduction to Neural Networks course.

Both the CBoW and Skip-gram architectures learn word embeddings through a neural network structure comprising of the following layers:

  • an input layer;
  • a single hidden layer;
  • an output layer.

The weight matrix between the input and hidden layers, denoted as W1 or E, serves as the embeddings matrix. Each row of this matrix represents an embedding vector for a corresponding word, with the i-th row matching the i-th word in the vocabulary.

This matrix contains V (vocabulary size) embeddings, each of size N, a dimension we specify. Multiplying the transpose of this matrix (NxV matrix) by a one-hot encoded vector (Vx1 vector) retrieves the embedding for a specific word, producing an Nx1 vector.

Embeddings matrix

The second weight matrix, between the hidden and output layers, is sized NxV. Multiplying the transpose of this matrix (VxN matrix) by the hidden layer's Nx1 vector results in a Vx1 vector.

CBoW

Now, consider an example using a CBoW model:

CBoW

First, the transpose of the embeddings matrix is multiplied by the one-hot vectors of the context words to produce their embeddings. These embeddings are then summed or averaged depending on the implementation to form a single vector. This vector is multiplied by the W2 matrix, resulting in a Vx1 vector.

Finally, this vector passes through the softmax activation function, converting it into a probability distribution, where each element represents the probability of a vocabulary word being the target word.

Softmax

Afterward, the loss is calculated, and both weight matrices are updated to minimize this loss. Ideally, we want the probability of the target word to be close to 1, while the probabilities for all other words approach zero. This process is repeated for every combination of a target word and its context words.

Once all combinations have been processed, an epoch is completed. Typically, the neural network is trained over several epochs to ensure accurate learning. Finally, the rows of the resulting embeddings matrix can be used as our word embeddings. Each row corresponds to the vector representation of a specific word in the vocabulary, effectively capturing its semantic properties within the trained model.

Skip-gram

Let's now take a look at a skip-gram model:

Skip-gram

As you can see, the process is mostly similar to CBoW. It begins by retrieving the embedding of the target word, which is then used in the hidden layer. This is followed by producing a Vx1 vector in the output layer. This vector, obtained by multiplying the target word's embedding with the output layer's weight matrix, is then transformed by the softmax activation function into a vector of probabilities.

Although this resulting vector of probabilities is the same for all context words associated with a single target word during a single training step, the loss for each context word is calculated individually.

The loss for each context word is summed up, and the weight matrices are updated accordingly at each iteration to minimize the total loss. Once the specified number of epochs is completed, the embeddings matrix can be used to obtain the word embeddings.

In practice, especially with large vocabularies, the softmax function can be computationally intensive. Therefore, approximations like negative sampling are often used to make the computation more efficient. It approximates the softmax and significantly speed up training, but they serve the same purpose as an activation function at the output layer to model the probability distribution over words. This topic, however, is a bit too advanced for our course.

Tudo estava claro?

Seção 4. Capítulo 2
some-alt