Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen L1 vs L2 Regularization: Intuition and Effects | Regularization Fundamentals
Feature Selection and Regularization Techniques

bookL1 vs L2 Regularization: Intuition and Effects

L1 and L2 regularization are two powerful techniques to control model complexity and prevent overfitting, especially in linear models. Both methods add a penalty term to the loss function, but they use different mathematical formulations that lead to distinct behaviors.

The L1 penalty, used in Lasso regression, adds the absolute value of coefficients to the loss. The Lasso penalty is λwi\lambda \sum |w_i|. This formulation can shrink some coefficients entirely to zero, effectively performing feature selection. The loss function for Lasso regression is:

Loss=Residual Sum of Squares+λiwi\text{Loss} = \text{Residual Sum of Squares} + \lambda \sum_i |w_i|

The L2 penalty, used in Ridge regression, adds the squared magnitude of coefficients to the loss function. Mathematically, for parameters ww, the Ridge penalty is λwi2\lambda \sum w_i^2, where λ\lambda is a non-negative regularization strength. This encourages all coefficients to be small, but rarely drives them exactly to zero. The loss function for Ridge regression becomes:

Loss=Residual Sum of Squares+λiwi2\text{Loss} = \text{Residual Sum of Squares} + \lambda \sum_i w_i^2

The key difference is that L2 regularization tends to distribute shrinkage more evenly across all coefficients, while L1 regularization can produce sparse solutions by setting some coefficients exactly to zero.

123456789101112131415161718192021222324252627282930313233343536373839404142434445
import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import Lasso, Ridge from sklearn.datasets import make_regression # Generate synthetic data with 10 features, only 3 informative X, y, coef = make_regression( n_samples=100, n_features=10, n_informative=3, noise=10, coef=True, random_state=42 ) alphas = np.logspace(-2, 2, 100) coefs_lasso = [] coefs_ridge = [] for a in alphas: lasso = Lasso(alpha=a, max_iter=10000) lasso.fit(X, y) coefs_lasso.append(lasso.coef_) ridge = Ridge(alpha=a) ridge.fit(X, y) coefs_ridge.append(ridge.coef_) plt.figure(figsize=(14, 6)) plt.subplot(1, 2, 1) plt.plot(alphas, coefs_lasso) plt.xscale("log") plt.xlabel("alpha (L1 penalty strength)") plt.ylabel("Coefficient value") plt.title("Lasso Paths (L1)") plt.axhline(0, color="black", linestyle="--", linewidth=1) plt.grid(True) plt.subplot(1, 2, 2) plt.plot(alphas, coefs_ridge) plt.xscale("log") plt.xlabel("alpha (L2 penalty strength)") plt.ylabel("Coefficient value") plt.title("Ridge Paths (L2)") plt.axhline(0, color="black", linestyle="--", linewidth=1) plt.grid(True) plt.tight_layout() plt.show()
copy

Looking at the coefficient paths above, you can see the practical effects of L1 and L2 regularization. As the penalty strength alpha increases, Ridge regression (L2) smoothly shrinks all coefficients toward zero, but rarely makes any coefficient exactly zero. In contrast, Lasso regression (L1) not only shrinks coefficients but also forces many of them to become exactly zero as alpha increases, resulting in a sparse solution. This sparsity means Lasso can automatically select important features by excluding uninformative ones, while Ridge tends to keep all features but with smaller weights. Understanding these differences helps you choose the right regularization method for your modeling goals—whether you want to keep all predictors with reduced influence or prefer a model that highlights only the most relevant features.

question mark

Which statements accurately describe the key differences between L1 and L2 regularization and their effects on model coefficients?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 1. Kapitel 2

Fragen Sie AI

expand

Fragen Sie AI

ChatGPT

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Suggested prompts:

Can you explain when to use L1 vs L2 regularization in practice?

What are the main advantages and disadvantages of Lasso and Ridge regression?

How does the choice of alpha affect model performance and feature selection?

Awesome!

Completion rate improved to 8.33

bookL1 vs L2 Regularization: Intuition and Effects

Swipe um das Menü anzuzeigen

L1 and L2 regularization are two powerful techniques to control model complexity and prevent overfitting, especially in linear models. Both methods add a penalty term to the loss function, but they use different mathematical formulations that lead to distinct behaviors.

The L1 penalty, used in Lasso regression, adds the absolute value of coefficients to the loss. The Lasso penalty is λwi\lambda \sum |w_i|. This formulation can shrink some coefficients entirely to zero, effectively performing feature selection. The loss function for Lasso regression is:

Loss=Residual Sum of Squares+λiwi\text{Loss} = \text{Residual Sum of Squares} + \lambda \sum_i |w_i|

The L2 penalty, used in Ridge regression, adds the squared magnitude of coefficients to the loss function. Mathematically, for parameters ww, the Ridge penalty is λwi2\lambda \sum w_i^2, where λ\lambda is a non-negative regularization strength. This encourages all coefficients to be small, but rarely drives them exactly to zero. The loss function for Ridge regression becomes:

Loss=Residual Sum of Squares+λiwi2\text{Loss} = \text{Residual Sum of Squares} + \lambda \sum_i w_i^2

The key difference is that L2 regularization tends to distribute shrinkage more evenly across all coefficients, while L1 regularization can produce sparse solutions by setting some coefficients exactly to zero.

123456789101112131415161718192021222324252627282930313233343536373839404142434445
import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import Lasso, Ridge from sklearn.datasets import make_regression # Generate synthetic data with 10 features, only 3 informative X, y, coef = make_regression( n_samples=100, n_features=10, n_informative=3, noise=10, coef=True, random_state=42 ) alphas = np.logspace(-2, 2, 100) coefs_lasso = [] coefs_ridge = [] for a in alphas: lasso = Lasso(alpha=a, max_iter=10000) lasso.fit(X, y) coefs_lasso.append(lasso.coef_) ridge = Ridge(alpha=a) ridge.fit(X, y) coefs_ridge.append(ridge.coef_) plt.figure(figsize=(14, 6)) plt.subplot(1, 2, 1) plt.plot(alphas, coefs_lasso) plt.xscale("log") plt.xlabel("alpha (L1 penalty strength)") plt.ylabel("Coefficient value") plt.title("Lasso Paths (L1)") plt.axhline(0, color="black", linestyle="--", linewidth=1) plt.grid(True) plt.subplot(1, 2, 2) plt.plot(alphas, coefs_ridge) plt.xscale("log") plt.xlabel("alpha (L2 penalty strength)") plt.ylabel("Coefficient value") plt.title("Ridge Paths (L2)") plt.axhline(0, color="black", linestyle="--", linewidth=1) plt.grid(True) plt.tight_layout() plt.show()
copy

Looking at the coefficient paths above, you can see the practical effects of L1 and L2 regularization. As the penalty strength alpha increases, Ridge regression (L2) smoothly shrinks all coefficients toward zero, but rarely makes any coefficient exactly zero. In contrast, Lasso regression (L1) not only shrinks coefficients but also forces many of them to become exactly zero as alpha increases, resulting in a sparse solution. This sparsity means Lasso can automatically select important features by excluding uninformative ones, while Ridge tends to keep all features but with smaller weights. Understanding these differences helps you choose the right regularization method for your modeling goals—whether you want to keep all predictors with reduced influence or prefer a model that highlights only the most relevant features.

question mark

Which statements accurately describe the key differences between L1 and L2 regularization and their effects on model coefficients?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 1. Kapitel 2
some-alt