Lära Overfitting and Model Complexity | Regularization Fundamentals

Understanding how your model performs on new, unseen data is a core challenge in supervised learning. Two concepts that often come up are overfitting and underfitting. Overfitting happens when your model learns not only the underlying pattern in the training data but also the noise—meaning it performs very well on the training set but poorly on new data. Underfitting is the opposite: your model is too simple to capture the underlying structure, resulting in poor performance on both training and test data.

This leads to the bias–variance tradeoff. Bias refers to errors introduced by approximating a real-world problem with a simplified model. Variance is the error introduced by sensitivity to small fluctuations in the training set. A model with high bias pays little attention to the training data and oversimplifies the model (underfitting). A model with high variance pays too much attention to the training data and does not generalize well (overfitting). Finding the right balance between bias and variance is crucial for building models that generalize well.


              1234567891011121314151617181920212223242526272829303132333435
            
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# Generate synthetic data
np.random.seed(0)
X = np.linspace(0, 1, 20)
y = 1.5 * X + np.random.normal(0, 0.15, size=X.shape)

# Reshape X for sklearn
X = X.reshape(-1, 1)

# Fit linear regression (degree 1)
poly1 = PolynomialFeatures(degree=1)
X_poly1 = poly1.fit_transform(X)
model1 = LinearRegression().fit(X_poly1, y)
y_pred1 = model1.predict(X_poly1)

# Fit polynomial regression (degree 15 - very complex)
poly15 = PolynomialFeatures(degree=15)
X_poly15 = poly15.fit_transform(X)
model15 = LinearRegression().fit(X_poly15, y)
y_pred15 = model15.predict(X_poly15)

# Plot results
plt.figure(figsize=(10, 5))
plt.scatter(X, y, color='black', label='Data')
plt.plot(X, y_pred1, color='blue', label='Degree 1 (Underfit)')
plt.plot(X, y_pred15, color='red', linestyle='--', label='Degree 15 (Overfit)')
plt.legend()
plt.title('Polynomial Regression: Underfitting vs Overfitting')
plt.xlabel('X')
plt.ylabel('y')
plt.show()

When you increase the complexity of your model, such as by raising the polynomial degree in regression, you give the model more flexibility to fit the training data. In the code above, the degree 1 polynomial (a straight line) cannot capture the pattern in the data well, resulting in underfitting. The degree 15 polynomial, on the other hand, fits the training data almost perfectly—including its noise—leading to overfitting. This model will likely perform poorly on new data because it has learned patterns that do not generalize. The key is to choose a model that is complex enough to capture the underlying trend, but not so complex that it memorizes noise.

This is why controlling model complexity is so important for generalization. You want your model to perform well on both the training data and unseen data. As you saw in the previous example, too simple a model leads to high bias and underfitting, while too complex a model leads to high variance and overfitting.

Definition

Regularization is a set of techniques used to control model complexity by adding a penalty to large parameter values in a model. By discouraging overly complex models, regularization helps prevent overfitting and improves the model's ability to generalize to new data.

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 1

Fråga AI

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Svep för att visa menyn


              1234567891011121314151617181920212223242526272829303132333435
            
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# Generate synthetic data
np.random.seed(0)
X = np.linspace(0, 1, 20)
y = 1.5 * X + np.random.normal(0, 0.15, size=X.shape)

# Reshape X for sklearn
X = X.reshape(-1, 1)

# Fit linear regression (degree 1)
poly1 = PolynomialFeatures(degree=1)
X_poly1 = poly1.fit_transform(X)
model1 = LinearRegression().fit(X_poly1, y)
y_pred1 = model1.predict(X_poly1)

# Fit polynomial regression (degree 15 - very complex)
poly15 = PolynomialFeatures(degree=15)
X_poly15 = poly15.fit_transform(X)
model15 = LinearRegression().fit(X_poly15, y)
y_pred15 = model15.predict(X_poly15)

# Plot results
plt.figure(figsize=(10, 5))
plt.scatter(X, y, color='black', label='Data')
plt.plot(X, y_pred1, color='blue', label='Degree 1 (Underfit)')
plt.plot(X, y_pred15, color='red', linestyle='--', label='Degree 15 (Overfit)')
plt.legend()
plt.title('Polynomial Regression: Underfitting vs Overfitting')
plt.xlabel('X')
plt.ylabel('y')
plt.show()

Definition

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 1