Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Cross-Validation | Section
Machine Learning Foundations with Scikit-Learn

bookCross-Validation

Swipe um das Menü anzuzeigen

The train-test split has two drawbacks:

  1. Less training data, which may reduce model quality;
  2. Dependence on the random split, causing unstable results. To overcome this, we use cross-validation.

First, divide the entire dataset into 5 equal parts, known as folds.

Next, use one fold as the test set and combine the remaining folds to form the training set.

As in any evaluation process, the training set is used to train the model, while the test set is used to measure its performance.

The process is repeated so that each fold serves as the test set once, while the remaining folds form the training set.

Cross-validation produces multiple accuracy scores—one per split. Their mean represents the model’s average performance. In Python, this is computed with cross_val_score().

Note
Note

You can choose any number of folds. For example, using 10 folds means training on 9 parts and testing on 1. This is set via the cv parameter in cross_val_score().

1234567891011
import pandas as pd from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import cross_val_score df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv') # Assign X, y variables (X is already preprocessed and y is already encoded) X, y = df.drop('species', axis=1), df['species'] # Print the cross-val scores and the mean for KNeighborsClassifier with 5 neighbors scores = cross_val_score(KNeighborsClassifier(), X, y) print(scores) print(scores.mean())
copy

Cross-validation is more reliable but slower, since the model is trained and evaluated n times. It is widely used in hyperparameter tuning, where cross-validation is repeated for each hyperparameter value—for example, testing multiple k values in k-NN. This helps choose the option that consistently performs best.

question mark

Why may cross-validation be preferred to train-test split for evaluating the performance of a machine learning model?

Wählen Sie die richtige Antwort aus

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 1. Kapitel 26

Fragen Sie AI

expand

Fragen Sie AI

ChatGPT

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Abschnitt 1. Kapitel 26
some-alt