KNeighborsClassifier
When creating the final estimator in a pipeline, the chosen model was KNeighborsClassifier. This chapter provides a brief explanation of how the algorithm operates.
How models work is not a main topic of this course, so it is OK if something seems unclear to you. It is explained in more detail in different courses like Linear Regression with Python or Classification with Python.
k-Nearest Neighbors
k-NN predicts the class of a new instance by looking at its k most similar training samples.
KNeighborsClassifier implements this in Scikit-learn.
- For a new point, find the k nearest neighbors using feature similarity.
- The most common class among them becomes the prediction.
k is a hyperparameter (default = 5). Different values change the modelβs behavior, so tuning k is important.
KNeighborsClassifier during .fit()
Unlike many algorithms, KNeighborsClassifier simply stores the training data.
Still, calling .fit(X, y) is required so the model knows which dataset to reference during prediction.
KNeighborsClassifier during .predict()
During prediction, the classifier searches for each instanceβs k closest neighbors. In the visual example, only two features are shown; adding more features usually improves class separation and prediction accuracy.
In the gifs above, only two features, 'body_mass_g' and 'culmen_depth_mm', are used because visualizing higher-dimensional plots is challenging. Including additional features will likely help the model better separate the green and red data points, enabling the KNeighborsClassifier to make more accurate predictions.
KNeighborsClassifier Coding Example
You can create a classifier, train it, and check its accuracy using .score().
The n_neighbors argument controls kβtry both 5 and 1.
12345678910111213import pandas as pd from sklearn.neighbors import KNeighborsClassifier df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv') # Assign X, y variables (X is already preprocessed and y is already encoded) X, y = df.drop('species', axis=1), df['species'] # Initialize and train a model knn5 = KNeighborsClassifier().fit(X, y) # Trained 5 neighbors model knn1 = KNeighborsClassifier(n_neighbors=1).fit(X, y) # Trained 1 neighbor model # Print the scores of both models print('5 Neighbors score:',knn5.score(X, y)) print('1 Neighbor score:',knn1.score(X, y))
Using k=1 may yield perfect accuracy, but this is misleading because evaluation was performed on the training set.
To measure true performance, always test the model on unseen data.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 3.13
KNeighborsClassifier
Swipe to show menu
When creating the final estimator in a pipeline, the chosen model was KNeighborsClassifier. This chapter provides a brief explanation of how the algorithm operates.
How models work is not a main topic of this course, so it is OK if something seems unclear to you. It is explained in more detail in different courses like Linear Regression with Python or Classification with Python.
k-Nearest Neighbors
k-NN predicts the class of a new instance by looking at its k most similar training samples.
KNeighborsClassifier implements this in Scikit-learn.
- For a new point, find the k nearest neighbors using feature similarity.
- The most common class among them becomes the prediction.
k is a hyperparameter (default = 5). Different values change the modelβs behavior, so tuning k is important.
KNeighborsClassifier during .fit()
Unlike many algorithms, KNeighborsClassifier simply stores the training data.
Still, calling .fit(X, y) is required so the model knows which dataset to reference during prediction.
KNeighborsClassifier during .predict()
During prediction, the classifier searches for each instanceβs k closest neighbors. In the visual example, only two features are shown; adding more features usually improves class separation and prediction accuracy.
In the gifs above, only two features, 'body_mass_g' and 'culmen_depth_mm', are used because visualizing higher-dimensional plots is challenging. Including additional features will likely help the model better separate the green and red data points, enabling the KNeighborsClassifier to make more accurate predictions.
KNeighborsClassifier Coding Example
You can create a classifier, train it, and check its accuracy using .score().
The n_neighbors argument controls kβtry both 5 and 1.
12345678910111213import pandas as pd from sklearn.neighbors import KNeighborsClassifier df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_pipelined.csv') # Assign X, y variables (X is already preprocessed and y is already encoded) X, y = df.drop('species', axis=1), df['species'] # Initialize and train a model knn5 = KNeighborsClassifier().fit(X, y) # Trained 5 neighbors model knn1 = KNeighborsClassifier(n_neighbors=1).fit(X, y) # Trained 1 neighbor model # Print the scores of both models print('5 Neighbors score:',knn5.score(X, y)) print('1 Neighbor score:',knn1.score(X, y))
Using k=1 may yield perfect accuracy, but this is misleading because evaluation was performed on the training set.
To measure true performance, always test the model on unseen data.
Thanks for your feedback!