Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Cross-Validation | Modeling
Quizzes & Challenges
Quizzes
Challenges
/
Introduction to Machine Learning with Python

bookCross-Validation

The train-test split has two drawbacks:

  1. Less training data, which may reduce model quality;
  2. Dependence on the random split, causing unstable results. To overcome this, we use cross-validation.

First, divide the entire dataset into 5 equal parts, known as folds.

Next, use one fold as the test set and combine the remaining folds to form the training set.

As in any evaluation process, the training set is used to train the model, while the test set is used to measure its performance.

The process is repeated so that each fold serves as the test set once, while the remaining folds form the training set.

Cross-validation produces multiple accuracy scoresβ€”one per split. Their mean represents the model’s average performance. In Python, this is computed with cross_val_score().

Note
Note

You can choose any number of folds. For example, using 10 folds means training on 9 parts and testing on 1. This is set via the cv parameter in cross_val_score().

1234567891011
import pandas as pd from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import cross_val_score df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv') # Assign X, y variables (X is already preprocessed and y is already encoded) X, y = df.drop('species', axis=1), df['species'] # Print the cross-val scores and the mean for KNeighborsClassifier with 5 neighbors scores = cross_val_score(KNeighborsClassifier(), X, y) print(scores) print(scores.mean())
copy

Cross-validation is more reliable but slower, since the model is trained and evaluated n times. It is widely used in hyperparameter tuning, where cross-validation is repeated for each hyperparameter valueβ€”for example, testing multiple k values in k-NN. This helps choose the option that consistently performs best.

question mark

Why may cross-validation be preferred to train-test split for evaluating the performance of a machine learning model?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 4. ChapterΒ 4

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain why cross-validation is more reliable than a simple train-test split?

How does cross-validation help with hyperparameter tuning?

What are some common variations of cross-validation besides k-fold?

Awesome!

Completion rate improved to 3.13

bookCross-Validation

Swipe to show menu

The train-test split has two drawbacks:

  1. Less training data, which may reduce model quality;
  2. Dependence on the random split, causing unstable results. To overcome this, we use cross-validation.

First, divide the entire dataset into 5 equal parts, known as folds.

Next, use one fold as the test set and combine the remaining folds to form the training set.

As in any evaluation process, the training set is used to train the model, while the test set is used to measure its performance.

The process is repeated so that each fold serves as the test set once, while the remaining folds form the training set.

Cross-validation produces multiple accuracy scoresβ€”one per split. Their mean represents the model’s average performance. In Python, this is computed with cross_val_score().

Note
Note

You can choose any number of folds. For example, using 10 folds means training on 9 parts and testing on 1. This is set via the cv parameter in cross_val_score().

1234567891011
import pandas as pd from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import cross_val_score df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv') # Assign X, y variables (X is already preprocessed and y is already encoded) X, y = df.drop('species', axis=1), df['species'] # Print the cross-val scores and the mean for KNeighborsClassifier with 5 neighbors scores = cross_val_score(KNeighborsClassifier(), X, y) print(scores) print(scores.mean())
copy

Cross-validation is more reliable but slower, since the model is trained and evaluated n times. It is widely used in hyperparameter tuning, where cross-validation is repeated for each hyperparameter valueβ€”for example, testing multiple k values in k-NN. This helps choose the option that consistently performs best.

question mark

Why may cross-validation be preferred to train-test split for evaluating the performance of a machine learning model?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 4. ChapterΒ 4
some-alt