Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Overfitting | Choosing The Best Model
Linear Regression with Python

bookOverfitting

Overfitting

Consider the two regression lines below. Which one is better?

The metrics suggest the second model is better, so we use it to predict X_new = [0.2, 0.5, 2.7]. But after comparing predictions with actual values, the first model performs better.

This happens because the second model overfits — it is too complex and matches the training data too closely, failing to generalize to new instances.

Underfitting

Underfitting occurs when a model is too simple to fit even the training data, which also leads to poor predictions on unseen data.

So we can try to tell whether the model underfits or overfits visually.

Since we cannot visualize high-dimensional models, we need another way to detect overfitting or underfitting.

Train-Test Split

To estimate performance on unseen data, we split the dataset into a training set and a test set with known targets.

We train on the training set and compute metrics on both the training and test sets to compare performance.

The split must be random. Typically, 20–30% goes to the test set, and 70–80% is used for training. Scikit-learn provides an easy way to do this.

For example, to split the training set to 70% training/30% test, you can use the following code:

from sklearn.model_selection import train_test_split # import the function
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
question-icon

Based on MSE scores of the models find out whether they overfit/underfit the training set(dataset is the same).

Model 1: Training set's MSE=0.2, Test set's MSE=0.215 .
Model 2: Training set's MSE=0.14, Test set's MSE=0.42
.
Model 3: Training set's MSE=0.5, Test set's MSE=0.47
.

Click or drag`n`drop items and fill in the blanks

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 4. Chapter 2

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 5.26

bookOverfitting

Swipe to show menu

Overfitting

Consider the two regression lines below. Which one is better?

The metrics suggest the second model is better, so we use it to predict X_new = [0.2, 0.5, 2.7]. But after comparing predictions with actual values, the first model performs better.

This happens because the second model overfits — it is too complex and matches the training data too closely, failing to generalize to new instances.

Underfitting

Underfitting occurs when a model is too simple to fit even the training data, which also leads to poor predictions on unseen data.

So we can try to tell whether the model underfits or overfits visually.

Since we cannot visualize high-dimensional models, we need another way to detect overfitting or underfitting.

Train-Test Split

To estimate performance on unseen data, we split the dataset into a training set and a test set with known targets.

We train on the training set and compute metrics on both the training and test sets to compare performance.

The split must be random. Typically, 20–30% goes to the test set, and 70–80% is used for training. Scikit-learn provides an easy way to do this.

For example, to split the training set to 70% training/30% test, you can use the following code:

from sklearn.model_selection import train_test_split # import the function
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
question-icon

Based on MSE scores of the models find out whether they overfit/underfit the training set(dataset is the same).

Model 1: Training set's MSE=0.2, Test set's MSE=0.215 .
Model 2: Training set's MSE=0.14, Test set's MSE=0.42
.
Model 3: Training set's MSE=0.5, Test set's MSE=0.47
.

Click or drag`n`drop items and fill in the blanks

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 4. Chapter 2
some-alt