The Flaw of GridSearchCV
Before exploring GridSearchCV
, it is important to note that KNeighborsClassifier
has several hyperparameters beyond n_neighbors
. Two of them are: weights
and p
.
Weights
KNeighborsClassifier
predicts by finding the k nearest neighbors and assigning the most frequent class among them, regardless of how close each neighbor is.
An alternative is to weight neighbors by their distance, giving more influence to closer points. This is done with weights='distance'
.
By default, the classifier uses weights='uniform'
, where all neighbors contribute equally.
P
The p
hyperparameter defines how distances are calculated:
p=1
: Manhattan distance (sum of absolute differences between coordinates);p=2
: Euclidean distance (the straight-line distance, familiar from geometry).
A p
parameter can take any positive integer. There are many different distances, but they are harder to visualize than p=1
or p=2
.
Do not worry if the details of weights
or p
are unclear. They are introduced simply to show that there is more than one hyperparameter that can influence the modelβs predictions. Treat them as examples of hyperparameters that can be tuned.
In the previous chapter, GridSearchCV
was used to tune only n_neighbors
.
To search for the best combination of n_neighbors
, weights
, and p
, the param_grid
can be defined as:
param_grid = {
'n_neighbors': [1, 3, 5, 7],
'weights': ['distance', 'uniform'],
'p': [1, 2]
}
GridSearchCV
tries all the possible combinations to find the best, so it will try all of those:
That significantly increases the search space. For example:
param_grid = {
'n_neighbors': [1, 3, 5, 7, 9, 12, 15, 17, 20, 25],
'weights': ['distance', 'uniform'],
'p': [1, 2, 3, 4, 5]
}
With 100 hyperparameter combinations and 5-fold cross-validation, the model is trained and evaluated 500 times.
For small datasets, this is manageable, but with larger datasets and more complex models, the process becomes very slow.
To handle such cases, RandomizedSearchCV
is often preferred. It explores only a subset of all possible combinations, significantly reducing computation time while still providing strong results.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 3.13
The Flaw of GridSearchCV
Swipe to show menu
Before exploring GridSearchCV
, it is important to note that KNeighborsClassifier
has several hyperparameters beyond n_neighbors
. Two of them are: weights
and p
.
Weights
KNeighborsClassifier
predicts by finding the k nearest neighbors and assigning the most frequent class among them, regardless of how close each neighbor is.
An alternative is to weight neighbors by their distance, giving more influence to closer points. This is done with weights='distance'
.
By default, the classifier uses weights='uniform'
, where all neighbors contribute equally.
P
The p
hyperparameter defines how distances are calculated:
p=1
: Manhattan distance (sum of absolute differences between coordinates);p=2
: Euclidean distance (the straight-line distance, familiar from geometry).
A p
parameter can take any positive integer. There are many different distances, but they are harder to visualize than p=1
or p=2
.
Do not worry if the details of weights
or p
are unclear. They are introduced simply to show that there is more than one hyperparameter that can influence the modelβs predictions. Treat them as examples of hyperparameters that can be tuned.
In the previous chapter, GridSearchCV
was used to tune only n_neighbors
.
To search for the best combination of n_neighbors
, weights
, and p
, the param_grid
can be defined as:
param_grid = {
'n_neighbors': [1, 3, 5, 7],
'weights': ['distance', 'uniform'],
'p': [1, 2]
}
GridSearchCV
tries all the possible combinations to find the best, so it will try all of those:
That significantly increases the search space. For example:
param_grid = {
'n_neighbors': [1, 3, 5, 7, 9, 12, 15, 17, 20, 25],
'weights': ['distance', 'uniform'],
'p': [1, 2, 3, 4, 5]
}
With 100 hyperparameter combinations and 5-fold cross-validation, the model is trained and evaluated 500 times.
For small datasets, this is manageable, but with larger datasets and more complex models, the process becomes very slow.
To handle such cases, RandomizedSearchCV
is often preferred. It explores only a subset of all possible combinations, significantly reducing computation time while still providing strong results.
Thanks for your feedback!