Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Impara Modeling Summary | Section
Machine Learning Foundations with Scikit-Learn

bookModeling Summary

Scorri per mostrare il menu

You now know how to build a model, use pipelines, and tune hyperparameters. You also learned two evaluation methods: train-test split and cross-validation. The next step is to combine evaluation and tuning using GridSearchCV or RandomizedSearchCV.

Note
Note

Since our dataset is tiny, we will use the GridSearchCV, but everything said below also applies to a RandomizedSearchCV.

Since cross-validation is more stable than a single train-test split, the goal is to achieve the highest cross-validation score. GridSearchCV searches across hyperparameters and finds those that maximize this score. The best score is stored in .best_score_.

Note
Note

Hyperparameters that work best for one dataset may not generalize when new data arrives. Thus, .best_score_ may be higher than the model’s performance on entirely unseen data.

A common workflow: split into training and test sets; run cross-validation on the training set to tune the model; then evaluate the optimized model on the test set to measure real-world performance.

To summarize:

  1. Preprocess the data;
  2. Split into training and test sets;
  3. Use cross-validation on the training set to find the best configuration;
  4. Evaluate on the test set.
Note
Study More

The third step usually involves testing multiple algorithms and tuning their hyperparameters to identify the best option. For simplicity, only a single algorithm was used in this course.

Cross-validation is not always the best option. For large datasets, computing CV scores becomes expensive, while a train-test split becomes more stable thanks to the large test set.

Large datasets are often split into training, validation, and test sets. Hyperparameters are chosen based on validation set performance. Finally, the selected model is evaluated on the test set to verify how well it generalizes.

The penguins dataset is small, with only 342 instances. Because of this limited size, the cross-validation score will be used for evaluation in the next chapter.

question mark

Why is cross-validation particularly valuable for hyperparameter tuning in smaller datasets, as opposed to larger ones where train-test splits might be preferred?

Seleziona la risposta corretta

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 1. Capitolo 31

Chieda ad AI

expand

Chieda ad AI

ChatGPT

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Sezione 1. Capitolo 31
some-alt