Correlation Analysis: Pearson and Spearman
Understanding the relationships between numerical features is crucial for uncovering patterns in retail data.
Pearson vs. Spearman Correlation
-
Pearson correlation coefficient:
- Measures the strength and direction of a linear relationship between two continuous variables;
- Assumes that the data is normally distributed;
- Most appropriate when the relationship is linear and variables are continuous.
-
Spearman rank correlation coefficient:
- Assesses monotonic relationships, whether linear or not;
- Compares the rank order of the data, not the raw values;
- More robust to outliers and non-normality;
- Ideal for ordinal data, non-linear but monotonic trends, or when Pearson's assumptions are violated.
Use Pearson when:
- Both variables are continuous;
- You expect a linear association.
Use Spearman when:
- You have ordinal data;
- The relationship is monotonic but not linear;
- Your data contain outliers or violate Pearson's assumptions.
12345678910111213141516171819import pandas as pd # Example retail dataset data = { "sales": [200, 220, 250, 270, 300, 320, 350, 370, 400, 420], "foot_traffic": [50, 55, 60, 65, 68, 70, 75, 80, 85, 90], "discount": [5, 10, 0, 0, 15, 20, 25, 10, 5, 0] } df = pd.DataFrame(data) # Compute Pearson correlation pearson_corr = df.corr(method="pearson") print("Pearson correlation matrix:") print(pearson_corr) # Compute Spearman correlation spearman_corr = df.corr(method="spearman") print("\nSpearman correlation matrix:") print(spearman_corr)
When interpreting correlation results, pay attention to both the strength and the nature of the relationship:
- A Pearson correlation value close to 1 or -1 signals a strong linear relationship;
- Values near 0 suggest little or no linear association.
For example:
- If
salesandfoot_trafficshow a Pearson correlation of0.98, this means that as foot traffic increases, sales tend to increase in a nearly linear way.
However:
- If your data contains outliers or the relationship is not linear, the Pearson coefficient may be misleading.
In these cases, use the Spearman correlation, which captures monotonic trends:
- Monotonic means one variable consistently increases or decreases as the other changes, regardless of the exact shape of the relationship;
- For instance, if
discountandsaleshave a Spearman correlation of-0.70, higher discounts generally correspond to lower sales ranks, even if the pattern is not strictly linear.
Always consider both the context and the type of relationship in your data when deciding which correlation metric to use.
Kiitos palautteestasi!
Kysy tekoälyä
Kysy tekoälyä
Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme
Can you explain the difference between linear and monotonic relationships with examples?
How do I decide which correlation method to use for my own dataset?
What do the negative correlation values mean in this context?
Awesome!
Completion rate improved to 5.56
Correlation Analysis: Pearson and Spearman
Pyyhkäise näyttääksesi valikon
Understanding the relationships between numerical features is crucial for uncovering patterns in retail data.
Pearson vs. Spearman Correlation
-
Pearson correlation coefficient:
- Measures the strength and direction of a linear relationship between two continuous variables;
- Assumes that the data is normally distributed;
- Most appropriate when the relationship is linear and variables are continuous.
-
Spearman rank correlation coefficient:
- Assesses monotonic relationships, whether linear or not;
- Compares the rank order of the data, not the raw values;
- More robust to outliers and non-normality;
- Ideal for ordinal data, non-linear but monotonic trends, or when Pearson's assumptions are violated.
Use Pearson when:
- Both variables are continuous;
- You expect a linear association.
Use Spearman when:
- You have ordinal data;
- The relationship is monotonic but not linear;
- Your data contain outliers or violate Pearson's assumptions.
12345678910111213141516171819import pandas as pd # Example retail dataset data = { "sales": [200, 220, 250, 270, 300, 320, 350, 370, 400, 420], "foot_traffic": [50, 55, 60, 65, 68, 70, 75, 80, 85, 90], "discount": [5, 10, 0, 0, 15, 20, 25, 10, 5, 0] } df = pd.DataFrame(data) # Compute Pearson correlation pearson_corr = df.corr(method="pearson") print("Pearson correlation matrix:") print(pearson_corr) # Compute Spearman correlation spearman_corr = df.corr(method="spearman") print("\nSpearman correlation matrix:") print(spearman_corr)
When interpreting correlation results, pay attention to both the strength and the nature of the relationship:
- A Pearson correlation value close to 1 or -1 signals a strong linear relationship;
- Values near 0 suggest little or no linear association.
For example:
- If
salesandfoot_trafficshow a Pearson correlation of0.98, this means that as foot traffic increases, sales tend to increase in a nearly linear way.
However:
- If your data contains outliers or the relationship is not linear, the Pearson coefficient may be misleading.
In these cases, use the Spearman correlation, which captures monotonic trends:
- Monotonic means one variable consistently increases or decreases as the other changes, regardless of the exact shape of the relationship;
- For instance, if
discountandsaleshave a Spearman correlation of-0.70, higher discounts generally correspond to lower sales ranks, even if the pattern is not strictly linear.
Always consider both the context and the type of relationship in your data when deciding which correlation metric to use.
Kiitos palautteestasi!