Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn The Flaw of GridSearchCV | Modeling
ML Introduction with scikit-learn

bookThe Flaw of GridSearchCV

Before exploring GridSearchCV, it is important to note that KNeighborsClassifier has several hyperparameters beyond n_neighbors. Two of them are: weights and p.

Weights

KNeighborsClassifier predicts by finding the k nearest neighbors and assigning the most frequent class among them, regardless of how close each neighbor is.

An alternative is to weight neighbors by their distance, giving more influence to closer points. This is done with weights='distance'.

By default, the classifier uses weights='uniform', where all neighbors contribute equally.

P

The p hyperparameter defines how distances are calculated:

  • p=1: Manhattan distance (sum of absolute differences between coordinates);
  • p=2: Euclidean distance (the straight-line distance, familiar from geometry).

A p parameter can take any positive integer. There are many different distances, but they are harder to visualize than p=1 or p=2.

Note
Note

Do not worry if the details of weights or p are unclear. They are introduced simply to show that there is more than one hyperparameter that can influence the model’s predictions. Treat them as examples of hyperparameters that can be tuned.

In the previous chapter, GridSearchCV was used to tune only n_neighbors. To search for the best combination of n_neighbors, weights, and p, the param_grid can be defined as:

param_grid = {
    'n_neighbors': [1, 3, 5, 7],
    'weights': ['distance', 'uniform'],
    'p': [1, 2]
}

GridSearchCV tries all the possible combinations to find the best, so it will try all of those:

That significantly increases the search space. For example:

param_grid = {
    'n_neighbors': [1, 3, 5, 7, 9, 12, 15, 17, 20, 25],
    'weights': ['distance', 'uniform'],
    'p': [1, 2, 3, 4, 5]
}

With 100 hyperparameter combinations and 5-fold cross-validation, the model is trained and evaluated 500 times.

For small datasets, this is manageable, but with larger datasets and more complex models, the process becomes very slow.

To handle such cases, RandomizedSearchCV is often preferred. It explores only a subset of all possible combinations, significantly reducing computation time while still providing strong results.

question mark

The main problem of GridSearchCV is that it tries all possible combinations (of what's specified in param_grid) which may take a lot of time. Is this statement correct?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 4. ChapterΒ 7

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 3.13

bookThe Flaw of GridSearchCV

Swipe to show menu

Before exploring GridSearchCV, it is important to note that KNeighborsClassifier has several hyperparameters beyond n_neighbors. Two of them are: weights and p.

Weights

KNeighborsClassifier predicts by finding the k nearest neighbors and assigning the most frequent class among them, regardless of how close each neighbor is.

An alternative is to weight neighbors by their distance, giving more influence to closer points. This is done with weights='distance'.

By default, the classifier uses weights='uniform', where all neighbors contribute equally.

P

The p hyperparameter defines how distances are calculated:

  • p=1: Manhattan distance (sum of absolute differences between coordinates);
  • p=2: Euclidean distance (the straight-line distance, familiar from geometry).

A p parameter can take any positive integer. There are many different distances, but they are harder to visualize than p=1 or p=2.

Note
Note

Do not worry if the details of weights or p are unclear. They are introduced simply to show that there is more than one hyperparameter that can influence the model’s predictions. Treat them as examples of hyperparameters that can be tuned.

In the previous chapter, GridSearchCV was used to tune only n_neighbors. To search for the best combination of n_neighbors, weights, and p, the param_grid can be defined as:

param_grid = {
    'n_neighbors': [1, 3, 5, 7],
    'weights': ['distance', 'uniform'],
    'p': [1, 2]
}

GridSearchCV tries all the possible combinations to find the best, so it will try all of those:

That significantly increases the search space. For example:

param_grid = {
    'n_neighbors': [1, 3, 5, 7, 9, 12, 15, 17, 20, 25],
    'weights': ['distance', 'uniform'],
    'p': [1, 2, 3, 4, 5]
}

With 100 hyperparameter combinations and 5-fold cross-validation, the model is trained and evaluated 500 times.

For small datasets, this is manageable, but with larger datasets and more complex models, the process becomes very slow.

To handle such cases, RandomizedSearchCV is often preferred. It explores only a subset of all possible combinations, significantly reducing computation time while still providing strong results.

question mark

The main problem of GridSearchCV is that it tries all possible combinations (of what's specified in param_grid) which may take a lot of time. Is this statement correct?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 4. ChapterΒ 7
some-alt