Oppiskele Correlation Analysis: Pearson and Spearman | Bivariate and Correlation Analysis

Pyyhkäise näyttääksesi valikon

Understanding the relationships between numerical features is crucial for uncovering patterns in retail data.

Pearson vs. Spearman Correlation

Pearson correlation coefficient:
- Measures the strength and direction of a linear relationship between two continuous variables;
- Assumes that the data is normally distributed;
- Most appropriate when the relationship is linear and variables are continuous.
Spearman rank correlation coefficient:
- Assesses monotonic relationships, whether linear or not;
- Compares the rank order of the data, not the raw values;
- More robust to outliers and non-normality;
- Ideal for ordinal data, non-linear but monotonic trends, or when Pearson's assumptions are violated.

Use Pearson when:

Both variables are continuous;
You expect a linear association.

Use Spearman when:

You have ordinal data;
The relationship is monotonic but not linear;
Your data contain outliers or violate Pearson's assumptions.


              12345678910111213141516171819
            
import pandas as pd

# Example retail dataset
data = {
    "sales": [200, 220, 250, 270, 300, 320, 350, 370, 400, 420],
    "foot_traffic": [50, 55, 60, 65, 68, 70, 75, 80, 85, 90],
    "discount": [5, 10, 0, 0, 15, 20, 25, 10, 5, 0]
}
df = pd.DataFrame(data)

# Compute Pearson correlation
pearson_corr = df.corr(method="pearson")
print("Pearson correlation matrix:")
print(pearson_corr)

# Compute Spearman correlation
spearman_corr = df.corr(method="spearman")
print("\nSpearman correlation matrix:")
print(spearman_corr)

When interpreting correlation results, pay attention to both the strength and the nature of the relationship:

A Pearson correlation value close to 1 or -1 signals a strong linear relationship;
Values near 0 suggest little or no linear association.

For example:

If sales and foot_traffic show a Pearson correlation of 0.98, this means that as foot traffic increases, sales tend to increase in a nearly linear way.

However:

If your data contains outliers or the relationship is not linear, the Pearson coefficient may be misleading.

In these cases, use the Spearman correlation, which captures monotonic trends:

Monotonic means one variable consistently increases or decreases as the other changes, regardless of the exact shape of the relationship;
For instance, if discount and sales have a Spearman correlation of -0.70, higher discounts generally correspond to lower sales ranks, even if the pattern is not strictly linear.

Always consider both the context and the type of relationship in your data when deciding which correlation metric to use.

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 3. Luku 4

Kysy tekoälyä

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Osio 3. Luku 4