Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Cross-Validation | Modeling
ML Introduction with scikit-learn
course content

Course Content

ML Introduction with scikit-learn

ML Introduction with scikit-learn

1. Machine Learning Concepts
2. Preprocessing Data with Scikit-learn
3. Pipelines
4. Modeling

book
Cross-Validation

In the previous chapter, we explored the train-test split approach to evaluate the model. This approach has its downsides:

  1. We use only a part of the dataset for training;
    Naturally, the more data we give to the model, the more it has to train from, and the better the model's performance will be.
  2. A result can strongly depend on the split.
    As you saw in the previous chapter, since the dataset is split randomly, running the code several times can have reasonably different results.

So a different approach to evaluating a model called cross-validation exists.
Let's see how it works.

First, we split a whole dataset into 5 equal parts, called folds.

Then we take one fold as a test set and the other folds as a training set.

As always, we use a training set to train the model and a test set to evaluate the model.

Now, repeat the process for each fold to be a test set.

As a result, we get 5 accuracy scores for each split.
Now we can take the mean of those 5 scores to measure the average model's performance.
To calculate the cross-validation score in Python, we can use the cross_val_score() from the sklearn.model_selection module.

Note

Although the example is shown with 5 folds, you can use any number of folds for cross-validation. For example, you can use 10 folds, 9 for a training set and 1 for a test set. This is controlled using the cv argument of cross_val_score().

Here is an example:

1234567891011
import pandas as pd from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import cross_val_score df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv') # Assign X, y variables (X is already preprocessed and y is already encoded) X, y = df.drop('species', axis=1), df['species'] # Print the cross-val scores and the mean for KNeighborsClassifier with 5 neighbors scores = cross_val_score(KNeighborsClassifier(), X, y) print(scores) print(scores.mean())
copy

It shows more stable and reliable results than the train-test split method but is significantly slower since it needs to train and evaluate the model 5 times (or n times if you set n number of folds), while the train-test split does it once.
As you will soon see, cross-validation is usually used to determine the best hyperparameters (e.g., the best number of neighbors).

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 4. Chapter 4
We're sorry to hear that something went wrong. What happened?
some-alt