Learn Cosine Similarity and Pearson Correlation | Collaborative Filtering and Behavioral Matching Systems

Swipe to show menu

Cosine Similarity

Cosine similarity is a metric that quantifies the similarity between two vectors by measuring the cosine of the angle between them. It is defined mathematically as:

Cosine Similarity = $\frac {\sum_{i=1}^{n} A_i B_i} {\sqrt{\sum_{i=1}^{n} A_i^2} \cdot \sqrt{\sum_{i=1}^{n} B_i^2}}$

where A and B are user or item vectors. Geometrically, cosine similarity captures how aligned two vectors are, regardless of their magnitude. A value of 1 means the vectors point in the same direction (maximum similarity), 0 means they are orthogonal (no similarity), and -1 means they point in opposite directions. In behavioral matching, cosine similarity is useful when you care about the pattern of preferences rather than their absolute values, such as when comparing users who rate items on different scales.

Pearson Correlation

Pearson correlation measures the linear relationship between two vectors, taking into account both the direction and the mean of the data. The formula is:

$\text{Pearson Correlation} = \frac{\sum_{i=1}^{n} (A_i - \bar{A})(B_i - \bar{B})}{\sqrt{\sum_{i=1}^{n} (A_i - \bar{A})^2} \cdot \sqrt{\sum_{i=1}^{n} (B_i - \bar{B})^2}}$

where $\bar{A}$ and $\bar{B}$ are the means of vectors A and B. Pearson correlation ranges from -1 (perfect negative linear relationship) to 1 (perfect positive linear relationship), with 0 indicating no linear relationship. This metric is particularly useful when you want to compare users or items after removing individual biases, such as when one user tends to rate everything higher or lower than another.

Comparison

Cosine similarity

Focuses on the orientation of the vectors and ignores their mean and magnitude;
Is a good choice for sparse data where many ratings are missing or zero, but can overestimate similarity when users have systematically different rating scales.

Pearson correlation

Removes the average rating from each vector, making it robust to differences in scale or baseline preference;
Is better when you want to control for user or item bias, but may not capture non-linear relationships.

Example: Calculating Both Metrics

Suppose you have two users who rated four items. Their ratings are:

User 1: [4, 0, 3, 5]
User 2: [5, 1, 2, 4]

You can compute both cosine similarity and Pearson correlation for these vectors to compare their behavioral similarity.


              12345678910111213141516171819
            
import numpy as np

# User ratings
user1 = np.array([4, 0, 3, 5])
user2 = np.array([5, 1, 2, 4])

# Cosine similarity
def cosine_similarity(a, b):
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_product / (norm_a * norm_b)

cos_sim = cosine_similarity(user1, user2)
print(f"Cosine Similarity: {cos_sim:.4f}")

# Pearson correlation
pearson_corr = np.corrcoef(user1, user2)[0, 1]
print(f"Pearson Correlation: {pearson_corr:.4f}")

Step-by-Step Code Explanation

Defining User Rating Arrays

user1 = np.array([4, 0, 3, 5]);
user2 = np.array([5, 1, 2, 4]);

You create two NumPy arrays, each representing a user's ratings for four items. These arrays are your user vectors.

Defining the Cosine Similarity Function

def cosine_similarity(a, b):
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_product / (norm_a * norm_b)

Dot Product Calculation: dot_product = np.dot(a, b);
Multiplies corresponding elements of the two vectors and sums them. This measures how much the two vectors point in the same direction.
Norm Calculations:
norm_a = np.linalg.norm(a);
norm_b = np.linalg.norm(b);
Calculates the length (magnitude) of each user vector. The norm is the square root of the sum of the squares of the vector's elements.
Similarity Calculation: return dot_product / (norm_a * norm_b);
Divides the dot product by the product of the two norms. The result is the cosine of the angle between the two vectors, indicating their similarity.

Computing Cosine Similarity and Printing
cos_sim = cosine_similarity(user1, user2);
print(f"Cosine Similarity: {cos_sim:.4f}");
You call the function with your user vectors to compute their cosine similarity, then print the result formatted to four decimal places.

Computing Pearson Correlation and Printing
pearson_corr = np.corrcoef(user1, user2)[0, 1];
print(f"Pearson Correlation: {pearson_corr:.4f}");
You use np.corrcoef to calculate the Pearson correlation matrix for the two user vectors. The value at position [0, 1] gives the correlation between user1 and user2. Printing this value shows how strongly the users' rating patterns are linearly related after removing their average rating bias.

1. What is the main difference between cosine similarity and Pearson correlation when comparing user vectors?

2. Which metric should you choose if you want to compare users who rate items on very different scales (for example, one user always gives high ratings and another gives low ratings)?

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 3. Chapter 2