Learn Type I and Type II Errors | Statistical Foundations for A/B Testing

Swipe to show menu

Understanding error types is crucial for interpreting A/B test results. In hypothesis testing, a Type I error occurs when you incorrectly reject a true null hypothesis, also called a false positive. This means you conclude a difference exists when, in reality, there is none. For example, if you run an A/B test to see if a new button color increases clicks and you find a statistically significant result purely by chance (even though the new color has no real effect), you have made a Type I error.

A Type II error happens when you fail to reject a false null hypothesis, known as a false negative. This means you miss a real effect. Imagine your new feature actually increases user engagement, but your test fails to detect this improvement - perhaps because your sample size is too small or your test is not sensitive enough. In this case, you have made a Type II error.

Real-world scenarios help illustrate these errors:

Type I error (false positive): Launching a new checkout flow based on a test that incorrectly indicated higher conversion, leading to wasted development resources;
Type II error (false negative): Missing a valuable opportunity by not rolling out a feature that actually improves retention, because the test failed to detect its effect.


              123456789101112131415161718192021222324
            
import numpy as np

# Simulating 10,000 A/B tests where there is actually no effect (null hypothesis true)
np.random.seed(42)
n_tests = 10000
alpha = 0.05  # significance level

# Simulating p-values uniformly distributed between 0 and 1 (no true effect)
p_values = np.random.uniform(0, 1, n_tests)

# Type I error: proportion of tests where p-value < alpha (false positives)
type1_errors = np.sum(p_values < alpha)
type1_error_rate = type1_errors / n_tests

print(f"Type I error rate (alpha={alpha}): {type1_error_rate:.3f}")

# Simulating 10,000 A/B tests where there IS a real effect (null hypothesis false)
# Assume power = 0.8 (80% chance to detect the effect)
power = 0.8
# 80% of tests yield p < alpha (true positives), 20% yield p >= alpha (false negatives)
false_negatives = int((1 - power) * n_tests)
type2_error_rate = false_negatives / n_tests

print(f"Type II error rate (beta={1 - power}): {type2_error_rate:.3f}")

There is a trade-off between significance level (alpha), power (1 - beta), and error rates. Lowering alpha reduces the chance of Type I errors but increases the risk of Type II errors. Increasing sample size or effect size can boost power, reducing Type II errors. Strategies to minimize errors include:

Choosing an appropriate significance level based on business risk;
Ensuring adequate sample size to detect meaningful effects;
Pre-registering hypotheses to avoid "p-hacking";
Running sensitivity analyses to understand the impact of different thresholds.

Balancing these factors helps you make more reliable decisions from your A/B tests.

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 3. Chapter 2