CBoW and Skip-gram Models (Optional)

Both the CBoW and Skip-gram architectures learn word embeddings through a neural network structure comprising of the following layers:

an input layer;
a single hidden layer;
an output layer.

The weight matrix between the input and hidden layers, denoted as W1 or E, serves as the embeddings matrix. Each row of this matrix represents an embedding vector for a corresponding word, with the i-th row matching the i-th word in the vocabulary.

This matrix contains V (vocabulary size) embeddings, each of size N, a dimension we specify. Multiplying the transpose of this matrix (NxV matrix) by a one-hot encoded vector (Vx1 vector) retrieves the embedding for a specific word, producing an Nx1 vector.

The second weight matrix, between the hidden and output layers, is sized NxV. Multiplying the transpose of this matrix (VxN matrix) by the hidden layer's Nx1 vector results in a Vx1 vector.

CBoW

Now, consider an example using a CBoW model:

First, the transpose of the embeddings matrix is multiplied by the one-hot vectors of the context words to produce their embeddings. These embeddings are then summed or averaged depending on the implementation to form a single vector. This vector is multiplied by the W2 matrix, resulting in a Vx1 vector.

Finally, this vector passes through the softmax activation function, converting it into a probability distribution, where each element represents the probability of a vocabulary word being the target word.

Afterward, the loss is calculated, and both weight matrices are updated to minimize this loss. Ideally, we want the probability of the target word to be close to 1, while the probabilities for all other words approach zero. This process is repeated for every combination of a target word and its context words.

Once all combinations have been processed, an epoch is completed. Typically, the neural network is trained over several epochs to ensure accurate learning. Finally, the rows of the resulting embeddings matrix can be used as our word embeddings. Each row corresponds to the vector representation of a specific word in the vocabulary, effectively capturing its semantic properties within the trained model.

Skip-gram

Let's now take a look at a skip-gram model:

As you can see, the process is mostly similar to CBoW. It begins by retrieving the embedding of the target word, which is then used in the hidden layer. This is followed by producing a Vx1 vector in the output layer. This vector, obtained by multiplying the target word's embedding with the output layer's weight matrix, is then transformed by the softmax activation function into a vector of probabilities.

The loss for each context word is summed up, and the weight matrices are updated accordingly at each iteration to minimize the total loss. Once the specified number of epochs is completed, the embeddings matrix can be used to obtain the word embeddings.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 4. Capítulo 2

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Conteúdo do Curso

Introduction to NLP