Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Overview of Vector Space Models
course content

Course Content

Introduction to NLP

Overview of Vector Space ModelsOverview of Vector Space Models

The Need for Numerical Representation

Unlike humans, computers, inherently, do not understand text. While for us textual data is rich, complex, and highly nuanced, carrying meanings that are interpreted based on language, context, and cultural knowledge, for a computer, text is initially just a sequence of characters without inherent meaning.

To overcome these challenges, we turn to mathematical and statistical models that can process and analyze patterns within the data. However, these models require numerical input—they operate on vectors, matrices, and other mathematical structures, not on raw text.

Understanding Vector Space Models

Luckily, text representation models provide a solution for this problem, namely vector space models, which we will cover in this course.

Vector space models (VSMs) are mathematical representations of text data where either documents or words are converted into vectors of identifiers in a multi-dimensional space.

The mathematical concept can be defined as follows. Assume we have a document D in the vector space of documents V.

To recap, a document is a separate piece of text within a corpus, for example, an email within a corpus of emails.

The number of dimensions or columns for each document will be the total number of unique terms or words across all documents in the vector space. Therefore, the vector space can be denoted as:

V = {W1, W2, W3, ..., Wn}

where each document contains different words. Essentially, this vector space represents the vocabulary.

Vocabulary is the set of unique words or terms that are identified from the entire corpus being analyzed.

Now, we can represent a document in the vector space as follows:

D = {WD1, WD2, WD3, ..., WDn}

where WDn denotes the weight of word n in document D. Let's take a look at an example with 2 documents and unique terms (words):

Vector space example

Using these vector representations we could, for example, calculate the similarity score of these documents by calculating the angle between them (cosine of the angle to more precise) to find out how similar semantically they are.

Words as Vectors

This concept, however, can be extended to individual word representations through the technique known as word embeddings. Word embeddings operate under a similar mathematical principle but focus on representing individual words as vectors rather than entire documents. The dimensions in these vectors capture latent semantic features that are not directly interpretable.

Here is an example with 2-dimensional embeddings for three words:

Word embeddings

As you can see, words "woman" and "queen" as well as "queen" and "king" are rather similar and close to each other, while "woman" and "king" are rather far away from each other representing their semantic difference

Don't worry, we will discuss word embedding in detail later in this course.

Applications of Vector Space Models

Semantic Similarity
Information Retrieval
Text Classification and Clustering
Natural Language Understanding

Vector space models underpin a variety of NLP tasks, enabling:

  • Semantic Similarity: Computing the similarity between text documents or words based on their vector representations.
  • Information Retrieval: Enhancing search engines and recommendation systems to find content relevant to a user's query.
  • Text Classification and Clustering: Automatically categorizing documents into predefined classes or grouping similar documents together.
  • Natural Language Understanding: Facilitating deeper linguistic analyses that pave the way for applications like sentiment analysis, topic modeling, and more.

Everything was clear?

Section 3. Chapter 1
some-alt