Зміст курсу

ML Introduction with scikit-learn

1. Machine Learning Concepts

What is ML Types of Machine Learning Training Set Types of Data Machine Learning Workflow

2. Preprocessing Data with Scikit-learn

Scikit-learn Concepts Getting Familiar with Dataset Dealing with Missing Values Challenge: Imputing Missing Values OrdinalEncoder One-Hot Encoder LabelEncoder Challenge: Encoding Categorical Variables Why Scale the Data?StandardScaler, MinMaxScaler, MaxAbsScaler Challenge: Scaling the Features

3. Pipelines

What is Pipeline ColumnTransformer Efficient Data Preprocessing with Pipelines Challenge: Creating a Pipeline Final Estimator Challenge: Creating a Complete ML Pipeline

4. Modeling

Models KNeighborsClassifier Evaluating the Model Cross-Validation Challenge: Evaluating the Model with Cross-Validation GridSearchCV The Flaw of GridSearchCV Challenge: Tuning Hyperparameters with RandomizedSearchCV Modeling Summary Challenge: Putting It All Together

The Flaw of GridSearchCV

Before we discuss GridSearchCV, it should be noted that the KNeighborsClassifier has more than one hyperparameter to tweak. Until now, we have only used n_neighbors.

Let's shortly discuss two other hyperparameters: weights and p.

weights

As you probably remember, KNeighborsClassifier works by finding the k nearest neighbors. Then it assigns the most frequent class among those neighbors irrespective of how close each one is.

Another approach is to also consider the distance to that neighbor so that the closer neighbors' classes have more weight. This can be done by setting the weights='distance'.

By default, the first approach is used, which is set using weights='uniform'.

p

There are also different ways to calculate the distance. p hyperparameter controls it. Let's illustrate how the distance is calculated for p=1 and p=2:

p=1 is a Manhattan distance;
p=2 is a Euclidian distance that you learned in school.

A p parameter can take any positive integer. There are many different distances, but they are harder to visualize than p=1 or p=2.

In the last chapter, we used GridSeachCV to find the best value of n_neighbors.
What if we wanted to find the best combination of n_neighbors, weights, and p? Well, the param_grid would look like this:

param_grid = {'n_neighbors': [1, 3, 5, 7],
                         'weights': ['distance', 'uniform'],
                         'p': [1, 2]}

GridSearchCV tries all the possible combinations to find the best, so it will try all of those:

That's already a lot of work. But what if we want to try more values?

param_grid = {'n_neighbors': [1, 3, 5, 7, 9, 12, 15, 17, 20, 25],
                         'weights': ['distance', 'uniform'],
                         'p': [1, 2, 3, 4, 5]}

Now there are 100 combinations. And remember that we need to train and evaluate a model 5 times to get its cross-validation score, so this process is done 500 times.

It is not a problem for our tiny dataset, but usually, datasets are much larger, and training may take a lot of time. Doing this process 500 times is painfully slow in that case. That's why RandomizedSearchCV is used more often for larger datasets.

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 4. Розділ 7

Запитати АІ

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат