Word Embeddings Basics

Understanding Word Embeddings

Traditional text representation models like Bag of Words and TF-IDF have advanced natural language processing but have significant limitations. They fail to capture the semantic relationships between words due to treating each word independently of its context and produce high-dimensional, sparse matrices that are computationally inefficient for large text corpora.

Word embeddings address these issues by considering the context in which words appear, providing a more nuanced understanding of language.

Several models and techniques have been developed to generate effective word embeddings:

Word2Vec: This tool, created by Google researchers, transforms words into numerical vectors. It uses two methods: Continuous Bag of Words (CBoW), which predicts a word based on its context, and Skip-Gram, which does the opposite by predicting the surrounding context from a word;
GloVe: Developed by Stanford University, GloVe turns words into vectors using a different approach. It analyzes how often pairs of words occur together in the entire text corpus to learn about their relationships;
FastText: Created by Facebook AI Research, FastText improves upon Word2Vec by breaking down words into smaller parts called character n-grams. This allows the model to better handle morphologically rich languages and words not seen during training.

In fact, Word2Vec and FastText are the most commonly used models for generating word embeddings. However, since FastText is essentially an enhanced version of Word2Vec, we will focus exclusively on Word2Vec in our course.

How Word2Vec Works?

Word2Vec transforms words into vectors using a process that starts with one-hot encoding, where each word in a vocabulary is represented by a unique vector marked by a single '1' among zeros. Let's take a look at an example:

This vector serves as input to a neural network, which is designed to 'learn' the word embeddings. The network's architecture can follow one of two models: CBoW (Continuous Bag of Words), which predicts a target word based on the context provided by surrounding words, or Skip-Gram, which, conversely, predicts the surrounding context words based on the target word.

In both Word2Vec architectures, during each training iteration, the model is provided with a target word and the words surrounding it as the context represented as one-hot encoded vectors. The training dataset is thus effectively composed of these pairs or groups, where each target word is associated with its surrounding context words.

Every word in the vocabulary takes a turn being the target as the model iterates through the text using a sliding context window technique. This technique systematically moves across every word, ensuring comprehensive learning from all possible contexts within the corpus.

Let's take a look at an example with a window size equal to 2 to make make things clear:

A context window size of 2 means the model will include up to 2 words from both the left and the right of the target word, as long as those words are available within the sentence boundaries. As you can see, if there are fewer than 2 words on either side, the model will include as many words as are available.

Everything was clear?

Thanks for your feedback!

Section 4. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

Introduction to NLP