Lära Risk Minimization: Expected vs Empirical Risk

To understand how models learn, it is essential to distinguish between expected risk and empirical risk. These two ideas form the backbone of statistical learning theory and explain why models sometimes generalize well — and sometimes fail dramatically.

Expected Risk: The Ideal but Unreachable Goal

In an ideal world, you would evaluate your model over the entire true data distribution. This leads to the expected risk:

R(f) = \mathbb{E}_{(x,y) \sim P}[L(y, f(x))]

Here:

$P$ is the true (and usually unknown) distribution of data;
$f$ is the model;
$L$ is the loss function.

Expected risk answers the question:

"How well would my model perform on all possible data it could ever encounter?"

Unfortunately, you never get to see the entire distribution. If only real life were that generous.

Empirical Risk: What You Can Actually Compute

Since you only have a finite dataset, you approximate the expected risk with the empirical risk:

\hat{R}(f) = \frac{1}{n} \sum_{i=1}^n L(y_i, f(x_i))

Empirical risk simply averages the loss over the training samples. It's the practical version of the theoretical ideal.

Note

Expected risk is like measuring your driving skills on all roads in the world. Empirical risk is like testing yourself on the few routes you drive every day. Mastering just those doesn't guarantee you're ready for a mountain pass in Nepal.


              123456789101112131415161718192021
            
import numpy as np

# True synthetic distribution
rng = np.random.default_rng(42)
P_x = rng.normal(loc=0, scale=1, size=100000)     # large sample ≈ true distribution
true_y = 3 * P_x + 1 + rng.normal(0, 0.5, size=100000)

# Small training sample
idx = rng.choice(len(P_x), size=30, replace=False)
train_x = P_x[idx]
train_y = true_y[idx]

# Example model: f(x) = 2.8x + 0.8 (slightly off)
pred_true = 2.8 * P_x + 0.8
pred_train = 2.8 * train_x + 0.8

expected_risk = np.mean((true_y - pred_true)**2)
empirical_risk = np.mean((train_y - pred_train)**2)

print("Approx Expected Risk:", expected_risk)
print("Empirical Risk:", empirical_risk)

What this demonstrates

Empirical risk is almost always lower than the true expected risk.
The model matches the training points better than the overall distribution.
This gap is one of the root causes of overfitting.

Minimizing empirical risk is central to training machine learning models. The hope is that lowering the loss on the training data will also reduce the expected risk on unseen data. However, relying only on empirical risk can cause overfitting: the model may learn patterns that are specific to the training set, including noise. In such cases, it performs well on the training data but poorly on new examples. This creates the key challenge in machine learning — achieving a balance between fitting the data and ensuring good generalization.

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 2

Fråga AI

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Suggested prompts:

Can you explain more about the difference between overfitting and underfitting?

How can I reduce the gap between empirical risk and expected risk?

What are some common techniques to improve generalization in machine learning models?

Svep för att visa menyn