Learn Statistical Significance and p-values | Statistical Foundations for A/B Testing

Swipe to show menu

Statistical significance is a foundational concept in A/B testing, helping you decide whether observed differences between groups are likely due to chance or reflect a real effect. When you run an A/B test, you compare metrics (such as conversion rates) between two or more groups. However, just observing a difference does not mean it is meaningful; random variation can create apparent differences even when there is no true effect.

This is where the concept of the p-value comes in. The p-value measures the probability of obtaining results as extreme as those observed, assuming that there is actually no difference between the groups (the "null hypothesis" is true). A low p-value suggests that such an extreme result would be very unlikely if there were truly no effect, providing evidence against the null hypothesis.

Correct interpretation:

A p-value of 0.03 means that, if there were no true difference, there is a 3% probability of seeing a difference as large (or larger) than the one observed.

Incorrect interpretation:

A p-value of 0.03 does not mean there is a 97% chance your result is real;
A p-value does not tell you the probability that the null hypothesis is true or false.

Suppose you run an A/B test comparing the click-through rate (CTR) of two website versions. If you observe a p-value of 0.01, this suggests that such a difference would rarely occur by chance alone, and you may conclude that the new version performs differently. However, if the p-value is 0.50, the observed difference is likely due to random variation, and you cannot claim a real effect.

Common misconceptions include believing that a small p-value guarantees practical importance or that a non-significant result proves there is no effect. In reality, statistical significance only addresses the likelihood that the observed result was due to chance, not whether the effect is large, useful, or important.


              12345678910111213141516171819202122232425262728293031323334
            
import numpy as np
from scipy import stats

# Example: A/B test comparing conversion rates
# Group A: 1000 users, 120 converted
# Group B: 1000 users, 150 converted

# Conversion rates
conv_a = 120 / 1000
conv_b = 150 / 1000

# Number of successes and trials
success_a, n_a = 120, 1000
success_b, n_b = 150, 1000

# Calculating pooled probability
p_pool = (success_a + success_b) / (n_a + n_b)

# Standard error
se = np.sqrt(p_pool * (1 - p_pool) * (1/n_a + 1/n_b))

# Z-score
z = (conv_b - conv_a) / se

# Two-tailed p-value
p_value = 2 * (1 - stats.norm.cdf(abs(z)))

print(f"Conversion rate A: {conv_a:.3f}")
print(f"Conversion rate B: {conv_b:.3f}")
print(f"Z-score: {z:.2f}")
print(f"P-value: {p_value:.4f}")

# Interpretation:
# If p-value < 0.05, result is considered statistically significant.

Definition

A z-score measures how many standard deviations an observation or data point is from the mean of a distribution. In hypothesis testing, you use the z-score to determine how extreme your observed difference is compared to what is expected under the null hypothesis. A higher absolute z-score indicates a more significant difference, helping you assess whether the result is likely due to chance or represents a real effect.

The most common threshold for statistical significance is 0.05. If your p-value is below this threshold, you typically say the result is "statistically significant" - meaning the evidence against the null hypothesis is strong enough to reject it. However, the choice of threshold is arbitrary and should be considered in the context of your test.

It is important to remember the limitations of p-values:

A p-value only tells you how surprising your data would be if there were no effect; it does not measure the magnitude or importance of an effect;
Statistical significance does not guarantee practical significance or business impact;
P-values can be misleading if the sample size is too small or too large, or if multiple tests are performed without adjustment.

Always interpret p-values alongside other metrics, such as effect size and confidence intervals, and be cautious about drawing strong conclusions from statistical significance alone.

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 3. Chapter 1