Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Implementing Word2Vec | Word Embeddings
Introduction to NLP

Implementing Word2VecImplementing Word2Vec

Having understood how Word2Vec works, let's proceed to implement it using Python. The Gensim library, a robust open-source tool for natural language processing, provides a straightforward implementation through its Word2Vec class in gensim.models.

Preparing the Data

Word2Vec requires the text data to be tokenized, i.e., broken down into a list of lists where each inner list contains words from a specific sentence. For this example, we will use the novel Emma by English author Jane Austen as our corpus. We'll load a CSV file containing preprocessed sentences and then split each sentence into words:

This line emma_df['Sentence'].str.split() applies the .split() method to each sentence in the 'Sentence' column, resulting in a list of words for each sentence. Since the sentences were already preprocessed, with words separated by whitespaces, the .split() method is sufficient for this tokenization.

Training the Word2Vec Model

Now, let's focus on training the Word2Vec model using the tokenized data. In fact, the Word2Vec class offers a variety of parameters for customization; however, you will most commonly deal with the following parameters:

  • vector_size (100 by default): the dimensionality or size of the word embeddings;
  • window (5 by default): the context window size;
  • min_count (5 by default): words occurring fewer than this number will be ignored;
  • sg (0 by default): the model architecture to use (1 for Skip-Gram, 0 for CBOW).

Speaking of the model architectures, CBoW is suited for larger datasets and scenarios where computational efficiency is crucial. Skip-gram, on the other hand, is preferable for tasks that require detailed understanding of word contexts, particularly effective in smaller datasets or when dealing with rare words.

If you have read the previous chapter, you will recall that in CBOW, the input context words are either averaged or summed. The cbow_mean parameter specifies whether to use the sum (0), or the mean (1), which is the default.

Let's now take a look at an example:

Here, we set the embeddings size to 200, context window size to 5 and included all the words via setting min_count=1. By setting sg=0 we chose to use the CBoW model.

Selecting the embedding size and context window size, key hyperparameters of the model, involves balancing established guidelines with practical experimentation starting with the default values. Typically, embedding sizes range from 50 to 300 dimensions, chosen based on vocabulary complexity and the depth of semantic relationships needed. Larger sizes offer richer semantic capture but increase computational demands and risk overfitting in smaller datasets. Conversely, the context window size influences relationship capture: smaller windows are better for syntactic details, while larger windows help in understanding broader semantic contexts. The ideal window size depends on the corpus and use case, with narrative texts generally benefiting from larger windows, and technical texts from smaller ones.

Finding Similar Words

Having each word represented as a vector, we can now compute the similarity of two words as a similarity of two vectors. How can we measure it? Well, we could calculate the distance between these vectors, however, there are some drawbacks.

When two word vectors point in the same direction, even if one is longer (indicating perhaps more occurrences in the training data), they can still represent words with similar meanings, so what we do is consider the angle between the vectors. This is crucial because in many NLP tasks, especially involving word embeddings, the direction of the vectors—indicating the orientation of words in the vector space—carries more semantic meaning than the magnitude.

In practice, however, using an angle as a similarity metric directly isn't that convenient, so the cosine of the angle used, which ranges from -1 to 1. This is known as the cosine similarity. Here is an illustration:

Angles

The higher the cosine similarity, the more similar the two vectors are, and vice versa. For example, if two word vectors have a cosine similarity close to 1 (the angle close to 0 degrees), it indicates that they are closely related or similar in context within the vector space.

Let's now find the top-5 most similar word to the word "man" using cosine similarity:

model.wv accesses the word vectors of the trained model, while the .most_similar() method finds the words whose embeddings are closest to the embedding of the specified word, based on cosine similarity. The topn parameter determines the number of top-N similar words to return.

Which model is better for handling rare words within a corpus?

Selecione a resposta correta

Tudo estava claro?

Seção 4. Capítulo 3
course content

Conteúdo do Curso

Introduction to NLP

Implementing Word2VecImplementing Word2Vec

Having understood how Word2Vec works, let's proceed to implement it using Python. The Gensim library, a robust open-source tool for natural language processing, provides a straightforward implementation through its Word2Vec class in gensim.models.

Preparing the Data

Word2Vec requires the text data to be tokenized, i.e., broken down into a list of lists where each inner list contains words from a specific sentence. For this example, we will use the novel Emma by English author Jane Austen as our corpus. We'll load a CSV file containing preprocessed sentences and then split each sentence into words:

This line emma_df['Sentence'].str.split() applies the .split() method to each sentence in the 'Sentence' column, resulting in a list of words for each sentence. Since the sentences were already preprocessed, with words separated by whitespaces, the .split() method is sufficient for this tokenization.

Training the Word2Vec Model

Now, let's focus on training the Word2Vec model using the tokenized data. In fact, the Word2Vec class offers a variety of parameters for customization; however, you will most commonly deal with the following parameters:

  • vector_size (100 by default): the dimensionality or size of the word embeddings;
  • window (5 by default): the context window size;
  • min_count (5 by default): words occurring fewer than this number will be ignored;
  • sg (0 by default): the model architecture to use (1 for Skip-Gram, 0 for CBOW).

Speaking of the model architectures, CBoW is suited for larger datasets and scenarios where computational efficiency is crucial. Skip-gram, on the other hand, is preferable for tasks that require detailed understanding of word contexts, particularly effective in smaller datasets or when dealing with rare words.

If you have read the previous chapter, you will recall that in CBOW, the input context words are either averaged or summed. The cbow_mean parameter specifies whether to use the sum (0), or the mean (1), which is the default.

Let's now take a look at an example:

Here, we set the embeddings size to 200, context window size to 5 and included all the words via setting min_count=1. By setting sg=0 we chose to use the CBoW model.

Selecting the embedding size and context window size, key hyperparameters of the model, involves balancing established guidelines with practical experimentation starting with the default values. Typically, embedding sizes range from 50 to 300 dimensions, chosen based on vocabulary complexity and the depth of semantic relationships needed. Larger sizes offer richer semantic capture but increase computational demands and risk overfitting in smaller datasets. Conversely, the context window size influences relationship capture: smaller windows are better for syntactic details, while larger windows help in understanding broader semantic contexts. The ideal window size depends on the corpus and use case, with narrative texts generally benefiting from larger windows, and technical texts from smaller ones.

Finding Similar Words

Having each word represented as a vector, we can now compute the similarity of two words as a similarity of two vectors. How can we measure it? Well, we could calculate the distance between these vectors, however, there are some drawbacks.

When two word vectors point in the same direction, even if one is longer (indicating perhaps more occurrences in the training data), they can still represent words with similar meanings, so what we do is consider the angle between the vectors. This is crucial because in many NLP tasks, especially involving word embeddings, the direction of the vectors—indicating the orientation of words in the vector space—carries more semantic meaning than the magnitude.

In practice, however, using an angle as a similarity metric directly isn't that convenient, so the cosine of the angle used, which ranges from -1 to 1. This is known as the cosine similarity. Here is an illustration:

Angles

The higher the cosine similarity, the more similar the two vectors are, and vice versa. For example, if two word vectors have a cosine similarity close to 1 (the angle close to 0 degrees), it indicates that they are closely related or similar in context within the vector space.

Let's now find the top-5 most similar word to the word "man" using cosine similarity:

model.wv accesses the word vectors of the trained model, while the .most_similar() method finds the words whose embeddings are closest to the embedding of the specified word, based on cosine similarity. The topn parameter determines the number of top-N similar words to return.

Which model is better for handling rare words within a corpus?

Selecione a resposta correta

Tudo estava claro?

Seção 4. Capítulo 3
some-alt