L1 vs L2 Regularization: Intuition and Effects
L1 and L2 regularization are two powerful techniques to control model complexity and prevent overfitting, especially in linear models. Both methods add a penalty term to the loss function, but they use different mathematical formulations that lead to distinct behaviors.
The L1 penalty, used in Lasso regression, adds the absolute value of coefficients to the loss. The Lasso penalty is λ∑∣wi∣.
This formulation can shrink some coefficients entirely to zero, effectively performing feature selection.
The loss function for Lasso regression is:
The L2 penalty, used in Ridge regression, adds the squared magnitude of coefficients to the loss function. Mathematically, for parameters w, the Ridge penalty is λ∑wi2, where λ is a non-negative regularization strength. This encourages all coefficients to be small, but rarely drives them exactly to zero.
The loss function for Ridge regression becomes:
The key difference is that L2 regularization tends to distribute shrinkage more evenly across all coefficients, while L1 regularization can produce sparse solutions by setting some coefficients exactly to zero.
123456789101112131415161718192021222324252627282930313233343536373839404142434445import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import Lasso, Ridge from sklearn.datasets import make_regression # Generate synthetic data with 10 features, only 3 informative X, y, coef = make_regression( n_samples=100, n_features=10, n_informative=3, noise=10, coef=True, random_state=42 ) alphas = np.logspace(-2, 2, 100) coefs_lasso = [] coefs_ridge = [] for a in alphas: lasso = Lasso(alpha=a, max_iter=10000) lasso.fit(X, y) coefs_lasso.append(lasso.coef_) ridge = Ridge(alpha=a) ridge.fit(X, y) coefs_ridge.append(ridge.coef_) plt.figure(figsize=(14, 6)) plt.subplot(1, 2, 1) plt.plot(alphas, coefs_lasso) plt.xscale("log") plt.xlabel("alpha (L1 penalty strength)") plt.ylabel("Coefficient value") plt.title("Lasso Paths (L1)") plt.axhline(0, color="black", linestyle="--", linewidth=1) plt.grid(True) plt.subplot(1, 2, 2) plt.plot(alphas, coefs_ridge) plt.xscale("log") plt.xlabel("alpha (L2 penalty strength)") plt.ylabel("Coefficient value") plt.title("Ridge Paths (L2)") plt.axhline(0, color="black", linestyle="--", linewidth=1) plt.grid(True) plt.tight_layout() plt.show()
Looking at the coefficient paths above, you can see the practical effects of L1 and L2 regularization. As the penalty strength alpha increases, Ridge regression (L2) smoothly shrinks all coefficients toward zero, but rarely makes any coefficient exactly zero. In contrast, Lasso regression (L1) not only shrinks coefficients but also forces many of them to become exactly zero as alpha increases, resulting in a sparse solution. This sparsity means Lasso can automatically select important features by excluding uninformative ones, while Ridge tends to keep all features but with smaller weights. Understanding these differences helps you choose the right regularization method for your modeling goals—whether you want to keep all predictors with reduced influence or prefer a model that highlights only the most relevant features.
Kiitos palautteestasi!
Kysy tekoälyä
Kysy tekoälyä
Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme
Awesome!
Completion rate improved to 8.33
L1 vs L2 Regularization: Intuition and Effects
Pyyhkäise näyttääksesi valikon
L1 and L2 regularization are two powerful techniques to control model complexity and prevent overfitting, especially in linear models. Both methods add a penalty term to the loss function, but they use different mathematical formulations that lead to distinct behaviors.
The L1 penalty, used in Lasso regression, adds the absolute value of coefficients to the loss. The Lasso penalty is λ∑∣wi∣.
This formulation can shrink some coefficients entirely to zero, effectively performing feature selection.
The loss function for Lasso regression is:
The L2 penalty, used in Ridge regression, adds the squared magnitude of coefficients to the loss function. Mathematically, for parameters w, the Ridge penalty is λ∑wi2, where λ is a non-negative regularization strength. This encourages all coefficients to be small, but rarely drives them exactly to zero.
The loss function for Ridge regression becomes:
The key difference is that L2 regularization tends to distribute shrinkage more evenly across all coefficients, while L1 regularization can produce sparse solutions by setting some coefficients exactly to zero.
123456789101112131415161718192021222324252627282930313233343536373839404142434445import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import Lasso, Ridge from sklearn.datasets import make_regression # Generate synthetic data with 10 features, only 3 informative X, y, coef = make_regression( n_samples=100, n_features=10, n_informative=3, noise=10, coef=True, random_state=42 ) alphas = np.logspace(-2, 2, 100) coefs_lasso = [] coefs_ridge = [] for a in alphas: lasso = Lasso(alpha=a, max_iter=10000) lasso.fit(X, y) coefs_lasso.append(lasso.coef_) ridge = Ridge(alpha=a) ridge.fit(X, y) coefs_ridge.append(ridge.coef_) plt.figure(figsize=(14, 6)) plt.subplot(1, 2, 1) plt.plot(alphas, coefs_lasso) plt.xscale("log") plt.xlabel("alpha (L1 penalty strength)") plt.ylabel("Coefficient value") plt.title("Lasso Paths (L1)") plt.axhline(0, color="black", linestyle="--", linewidth=1) plt.grid(True) plt.subplot(1, 2, 2) plt.plot(alphas, coefs_ridge) plt.xscale("log") plt.xlabel("alpha (L2 penalty strength)") plt.ylabel("Coefficient value") plt.title("Ridge Paths (L2)") plt.axhline(0, color="black", linestyle="--", linewidth=1) plt.grid(True) plt.tight_layout() plt.show()
Looking at the coefficient paths above, you can see the practical effects of L1 and L2 regularization. As the penalty strength alpha increases, Ridge regression (L2) smoothly shrinks all coefficients toward zero, but rarely makes any coefficient exactly zero. In contrast, Lasso regression (L1) not only shrinks coefficients but also forces many of them to become exactly zero as alpha increases, resulting in a sparse solution. This sparsity means Lasso can automatically select important features by excluding uninformative ones, while Ridge tends to keep all features but with smaller weights. Understanding these differences helps you choose the right regularization method for your modeling goals—whether you want to keep all predictors with reduced influence or prefer a model that highlights only the most relevant features.
Kiitos palautteestasi!