Learn KNeighborsClassifier

When creating the final estimator in a pipeline, the chosen model was KNeighborsClassifier. This chapter provides a brief explanation of how the algorithm operates.

Note

How models work is not a main topic of this course, so it is OK if something seems unclear to you. It is explained in more detail in different courses like Linear Regression with Python or Classification with Python.

k-Nearest Neighbors

k-NN predicts the class of a new instance by looking at its k most similar training samples. KNeighborsClassifier implements this in Scikit-learn.

For a new point, find the k nearest neighbors using feature similarity.
The most common class among them becomes the prediction.

k is a hyperparameter (default = 5). Different values change the model’s behavior, so tuning k is important.

KNeighborsClassifier during .fit()

Unlike many algorithms, KNeighborsClassifier simply stores the training data. Still, calling .fit(X, y) is required so the model knows which dataset to reference during prediction.

KNeighborsClassifier during .predict()

During prediction, the classifier searches for each instance’s k closest neighbors. In the visual example, only two features are shown; adding more features usually improves class separation and prediction accuracy.

Note

In the gifs above, only two features, 'body_mass_g' and 'culmen_depth_mm', are used because visualizing higher-dimensional plots is challenging. Including additional features will likely help the model better separate the green and red data points, enabling the KNeighborsClassifier to make more accurate predictions.

KNeighborsClassifier Coding Example

You can create a classifier, train it, and check its accuracy using .score(). The n_neighbors argument controls k—try both 5 and 1.


              12345678910111213
            
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv')
# Assign X, y variables (X is already preprocessed and y is already encoded)
X, y = df.drop('species', axis=1), df['species']
# Initialize and train a model
knn5 = KNeighborsClassifier().fit(X, y) # Trained 5 neighbors model
knn1 = KNeighborsClassifier(n_neighbors=1).fit(X, y) # Trained 1 neighbor model
# Print the scores of both models
print('5 Neighbors score:',knn5.score(X, y))
print('1 Neighbor score:',knn1.score(X, y))

Using k=1 may yield perfect accuracy, but this is misleading because evaluation was performed on the training set. To measure true performance, always test the model on unseen data.

Everything was clear?

Thanks for your feedback!

Section 4. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 3.13

Swipe to show menu

When creating the final estimator in a pipeline, the chosen model was KNeighborsClassifier. This chapter provides a brief explanation of how the algorithm operates.

Note

k-Nearest Neighbors

k-NN predicts the class of a new instance by looking at its k most similar training samples. KNeighborsClassifier implements this in Scikit-learn.

For a new point, find the k nearest neighbors using feature similarity.
The most common class among them becomes the prediction.

k is a hyperparameter (default = 5). Different values change the model’s behavior, so tuning k is important.

KNeighborsClassifier during .fit()

Unlike many algorithms, KNeighborsClassifier simply stores the training data. Still, calling .fit(X, y) is required so the model knows which dataset to reference during prediction.

KNeighborsClassifier during .predict()

Note

KNeighborsClassifier Coding Example

You can create a classifier, train it, and check its accuracy using .score(). The n_neighbors argument controls k—try both 5 and 1.


              12345678910111213
            
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv')
# Assign X, y variables (X is already preprocessed and y is already encoded)
X, y = df.drop('species', axis=1), df['species']
# Initialize and train a model
knn5 = KNeighborsClassifier().fit(X, y) # Trained 5 neighbors model
knn1 = KNeighborsClassifier(n_neighbors=1).fit(X, y) # Trained 1 neighbor model
# Print the scores of both models
print('5 Neighbors score:',knn5.score(X, y))
print('1 Neighbor score:',knn1.score(X, y))

Using k=1 may yield perfect accuracy, but this is misleading because evaluation was performed on the training set. To measure true performance, always test the model on unseen data.

Everything was clear?

Thanks for your feedback!

Section 4. Chapter 2