Implementing Word2Vec

Having understood how Word2Vec works, let's proceed to implement it using Python. The Gensim library, a robust open-source tool for natural language processing, provides a straightforward implementation through its Word2Vec class in gensim.models.

Preparing the Data

Word2Vec requires the text data to be tokenized, i.e., broken down into a list of lists where each inner list contains words from a specific sentence. For this example, we will use the novel Emma by English author Jane Austen as our corpus. We'll load a CSV file containing preprocessed sentences and then split each sentence into words:


              12345678
            
import pandas as pd

emma_df = pd.read_csv(
    'https://codefinity-content-media-v2.s3.eu-west-1.amazonaws.com/courses/c68c1f2e-2c90-4d5d-8db9-1e97ca89d15e/section_4/chapter_3/emma.csv')
# Split each sentence into words
sentences = emma_df['Sentence'].str.split()
# Print the fourth sentence (list of words)
print(sentences[3])

This line emma_df['Sentence'].str.split() applies the .split() method to each sentence in the 'Sentence' column, resulting in a list of words for each sentence. Since the sentences were already preprocessed, with words separated by whitespaces, the .split() method is sufficient for this tokenization.

Training the Word2Vec Model

Now, let's focus on training the Word2Vec model using the tokenized data. In fact, the Word2Vec class offers a variety of parameters for customization; however, you will most commonly deal with the following parameters:

vector_size (100 by default): the dimensionality or size of the word embeddings;
window (5 by default): the context window size;
min_count (5 by default): words occurring fewer than this number will be ignored;
sg (0 by default): the model architecture to use (1 for Skip-Gram, 0 for CBOW).

Speaking of the model architectures, CBoW is suited for larger datasets and scenarios where computational efficiency is crucial. Skip-gram, on the other hand, is preferable for tasks that require detailed understanding of word contexts, particularly effective in smaller datasets or when dealing with rare words.

Let's now take a look at an example:


              12345678
            
from gensim.models import Word2Vec
import pandas as pd

emma_df = pd.read_csv(
    'https://codefinity-content-media-v2.s3.eu-west-1.amazonaws.com/courses/c68c1f2e-2c90-4d5d-8db9-1e97ca89d15e/section_4/chapter_3/emma.csv')
sentences = emma_df['Sentence'].str.split()
# Initialize the model
model = Word2Vec(sentences, vector_size=200, window=5, min_count=1, sg=0)

Here, we set the embeddings size to 200, context window size to 5 and included all the words via setting min_count=1. By setting sg=0 we chose to use the CBoW model.

Finding Similar Words

Having each word represented as a vector, we can now compute the similarity of two words as a similarity of two vectors. How can we measure it? Well, we could calculate the distance between these vectors, however, there are some drawbacks.

When two word vectors point in the same direction, even if one is longer (indicating perhaps more occurrences in the training data), they can still represent words with similar meanings, so what we do is consider the angle between the vectors. This is crucial because in many NLP tasks, especially involving word embeddings, the direction of the vectors—indicating the orientation of words in the vector space—carries more semantic meaning than the magnitude.

In practice, however, using an angle as a similarity metric directly isn't that convenient, so the cosine of the angle used, which ranges from -1 to 1. This is known as the cosine similarity. Here is an illustration:

The higher the cosine similarity, the more similar the two vectors are, and vice versa. For example, if two word vectors have a cosine similarity close to 1 (the angle close to 0 degrees), it indicates that they are closely related or similar in context within the vector space.

Let's now find the top-5 most similar word to the word "man" using cosine similarity:


              12345678910
            
from gensim.models import Word2Vec
import pandas as pd

emma_df = pd.read_csv(
    'https://codefinity-content-media-v2.s3.eu-west-1.amazonaws.com/courses/c68c1f2e-2c90-4d5d-8db9-1e97ca89d15e/section_4/chapter_3/emma.csv')
sentences = emma_df['Sentence'].str.split()
model = Word2Vec(sentences, vector_size=200, window=5, min_count=1, sg=0)
# Retrieve the top-5 most similar words to 'man'
similar_words = model.wv.most_similar('man', topn=5)
print(similar_words)

model.wv accesses the word vectors of the trained model, while the .most_similar() method finds the words whose embeddings are closest to the embedding of the specified word, based on cosine similarity. The topn parameter determines the number of top-N similar words to return.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 4. Capítulo 3