Oppiskele L1 vs L2 Regularization: Intuition and Effects

L1 and L2 regularization are two powerful techniques to control model complexity and prevent overfitting, especially in linear models. Both methods add a penalty term to the loss function, but they use different mathematical formulations that lead to distinct behaviors.

The L1 penalty, used in Lasso regression, adds the absolute value of coefficients to the loss. The Lasso penalty is $\lambda \sum |w_i|$ . This formulation can shrink some coefficients entirely to zero, effectively performing feature selection. The loss function for Lasso regression is:

\text{Loss} = \text{Residual Sum of Squares} + \lambda \sum_i |w_i|

The L2 penalty, used in Ridge regression, adds the squared magnitude of coefficients to the loss function. Mathematically, for parameters $w$ , the Ridge penalty is $\lambda \sum w_i^2$ , where $\lambda$ is a non-negative regularization strength. This encourages all coefficients to be small, but rarely drives them exactly to zero. The loss function for Ridge regression becomes:

\text{Loss} = \text{Residual Sum of Squares} + \lambda \sum_i w_i^2

The key difference is that L2 regularization tends to distribute shrinkage more evenly across all coefficients, while L1 regularization can produce sparse solutions by setting some coefficients exactly to zero.


              123456789101112131415161718192021222324252627282930313233343536373839404142434445
            
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso, Ridge
from sklearn.datasets import make_regression

# Generate synthetic data with 10 features, only 3 informative
X, y, coef = make_regression(
    n_samples=100, n_features=10, n_informative=3, noise=10, coef=True, random_state=42
)

alphas = np.logspace(-2, 2, 100)
coefs_lasso = []
coefs_ridge = []

for a in alphas:
    lasso = Lasso(alpha=a, max_iter=10000)
    lasso.fit(X, y)
    coefs_lasso.append(lasso.coef_)
    
    ridge = Ridge(alpha=a)
    ridge.fit(X, y)
    coefs_ridge.append(ridge.coef_)

plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
plt.plot(alphas, coefs_lasso)
plt.xscale("log")
plt.xlabel("alpha (L1 penalty strength)")
plt.ylabel("Coefficient value")
plt.title("Lasso Paths (L1)")
plt.axhline(0, color="black", linestyle="--", linewidth=1)
plt.grid(True)

plt.subplot(1, 2, 2)
plt.plot(alphas, coefs_ridge)
plt.xscale("log")
plt.xlabel("alpha (L2 penalty strength)")
plt.ylabel("Coefficient value")
plt.title("Ridge Paths (L2)")
plt.axhline(0, color="black", linestyle="--", linewidth=1)
plt.grid(True)

plt.tight_layout()
plt.show()

Looking at the coefficient paths above, you can see the practical effects of L1 and L2 regularization. As the penalty strength alpha increases, Ridge regression (L2) smoothly shrinks all coefficients toward zero, but rarely makes any coefficient exactly zero. In contrast, Lasso regression (L1) not only shrinks coefficients but also forces many of them to become exactly zero as alpha increases, resulting in a sparse solution. This sparsity means Lasso can automatically select important features by excluding uninformative ones, while Ridge tends to keep all features but with smaller weights. Understanding these differences helps you choose the right regularization method for your modeling goals—whether you want to keep all predictors with reduced influence or prefer a model that highlights only the most relevant features.

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 1. Luku 2

Kysy tekoälyä

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Awesome!

Completion rate improved to 8.33

Pyyhkäise näyttääksesi valikon

\text{Loss} = \text{Residual Sum of Squares} + \lambda \sum_i |w_i|

\text{Loss} = \text{Residual Sum of Squares} + \lambda \sum_i w_i^2


              123456789101112131415161718192021222324252627282930313233343536373839404142434445
            
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso, Ridge
from sklearn.datasets import make_regression

# Generate synthetic data with 10 features, only 3 informative
X, y, coef = make_regression(
    n_samples=100, n_features=10, n_informative=3, noise=10, coef=True, random_state=42
)

alphas = np.logspace(-2, 2, 100)
coefs_lasso = []
coefs_ridge = []

for a in alphas:
    lasso = Lasso(alpha=a, max_iter=10000)
    lasso.fit(X, y)
    coefs_lasso.append(lasso.coef_)
    
    ridge = Ridge(alpha=a)
    ridge.fit(X, y)
    coefs_ridge.append(ridge.coef_)

plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
plt.plot(alphas, coefs_lasso)
plt.xscale("log")
plt.xlabel("alpha (L1 penalty strength)")
plt.ylabel("Coefficient value")
plt.title("Lasso Paths (L1)")
plt.axhline(0, color="black", linestyle="--", linewidth=1)
plt.grid(True)

plt.subplot(1, 2, 2)
plt.plot(alphas, coefs_ridge)
plt.xscale("log")
plt.xlabel("alpha (L2 penalty strength)")
plt.ylabel("Coefficient value")
plt.title("Ridge Paths (L2)")
plt.axhline(0, color="black", linestyle="--", linewidth=1)
plt.grid(True)

plt.tight_layout()
plt.show()

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 1. Luku 2