Summary  
This chapter explains how to implement cosine similarity and distance-based measures (Euclidean and Manhattan) to quantify the similarity between high-dimensional vectors.  

General domain of usage  
Text mining

Understanding how similar two documents are is a core task in text mining. In vector space models, each document is represented as a vector in a high-dimensional space, where each dimension corresponds to a term or feature. Measuring similarity in this space allows you to compare documents, cluster them, or retrieve the most relevant ones for a given query. The choice of similarity measure can strongly influence the results of document modeling and retrieval, making it crucial to understand the underlying mathematics and geometry.

Cosine similarity is one of the most widely used measures for comparing document vectors. The formula for cosine similarity between two vectors **A** and **B** is:

$$
\text{cosine\_similarity}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{||\mathbf{A}|| \times ||\mathbf{B}||}
$$

Here, $$A · B$$ is the dot product of the two vectors, and $$||A||$$ and $$||B||$$ are their Euclidean norms (lengths). Geometrically, cosine similarity measures the cosine of the angle between the two vectors. If the vectors point in exactly the same direction, the similarity is 1; if they are orthogonal, the similarity is 0; if they point in opposite directions, the similarity is -1. In document spaces, negative values are rare because term frequencies are non-negative, so cosine similarity typically ranges from 0 to 1. This measure is especially useful because it focuses on the orientation of the vectors, not their magnitude, making it robust to differences in document length.

While **cosine similarity** focuses on the angle between vectors, **distance-based measures** such as **Euclidean** and **Manhattan distances** consider the absolute difference between vector positions. The Euclidean distance between vectors $$A$$ and $$B$$ is given by:

$$
\text{euclidean\_distance}(\mathbf{A}, \mathbf{B}) = \sqrt{\sum_{i=1}^{n} (A_i - B_i)^2}
$$

The Manhattan distance, also known as **L1 distance**, is defined as:

$$
\text{manhattan\_distance}(\mathbf{A}, \mathbf{B}) = \sum_{i=1}^{n} |A_i - B_i|
$$

**Distance measures** quantify how far apart two document vectors are in the feature space.

- Euclidean distance is sensitive to the magnitude of the vectors and can be dominated by longer documents or features with large values;
- Manhattan distance sums the absolute differences across all dimensions, which can be more robust to outliers in some cases.

You should choose **cosine similarity** when you care about the direction (pattern of term usage) rather than the length (total term count), such as when comparing documents of different lengths. **Distance-based measures** are more appropriate when absolute differences are meaningful, or when all documents are normalized to the same length. Understanding these properties helps you select the right similarity measure for your document modeling and retrieval tasks.

What does cosine similarity measure between two document vectors?

Learn to represent documents as high-dimensional vectors, apply TF-IDF weighting, measure document similarity, and discover structure in text collections using geometric and statistical modeling—without linguistic or NLP assumptions.

Transform raw documents into structured, high-dimensional numerical data using vector space models and weighting schemes.

Define, compute, and interpret document similarity using geometric and statistical measures for retrieval and ranking.

Discover structure in document collections by grouping similar documents and analyzing global patterns.

Similarity Measures: Cosine and Distance