Learn Overfitting | Choosing The Best Model

Overfitting

Consider the two regression lines below. Which one is better?

The metrics suggest the second model is better, so we use it to predict X_new = [0.2, 0.5, 2.7]. But after comparing predictions with actual values, the first model performs better.

This happens because the second model overfits — it is too complex and matches the training data too closely, failing to generalize to new instances.

Underfitting

Underfitting occurs when a model is too simple to fit even the training data, which also leads to poor predictions on unseen data.

So we can try to tell whether the model underfits or overfits visually.

Since we cannot visualize high-dimensional models, we need another way to detect overfitting or underfitting.

Train-Test Split

To estimate performance on unseen data, we split the dataset into a training set and a test set with known targets.

We train on the training set and compute metrics on both the training and test sets to compare performance.

The split must be random. Typically, 20–30% goes to the test set, and 70–80% is used for training. Scikit-learn provides an easy way to do this.

For example, to split the training set to 70% training/30% test, you can use the following code:

from sklearn.model_selection import train_test_split # import the function
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Everything was clear?

Thanks for your feedback!

Section 4. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 5.26

Swipe to show menu