Зміст курсу

ML Introduction with scikit-learn

1. Machine Learning Concepts

What is ML Types of Machine Learning Training Set Types of Data Machine Learning Workflow

2. Preprocessing Data with Scikit-learn

Scikit-learn Concepts Getting Familiar with Dataset Dealing with Missing Values Challenge: Imputing Missing Values OrdinalEncoder One-Hot Encoder LabelEncoder Challenge: Encoding Categorical Variables Why Scale the Data?StandardScaler, MinMaxScaler, MaxAbsScaler Challenge: Scaling the Features

3. Pipelines

What is Pipeline ColumnTransformer Efficient Data Preprocessing with Pipelines Challenge: Creating a Pipeline Final Estimator Challenge: Creating a Complete ML Pipeline

4. Modeling

Models KNeighborsClassifier Evaluating the Model Cross-Validation Challenge: Evaluating the Model with Cross-Validation GridSearchCV The Flaw of GridSearchCV Challenge: Tuning Hyperparameters with RandomizedSearchCV Modeling Summary Challenge: Putting It All Together

Getting Familiar with Dataset

Let's start preprocessing by exploring the dataset. Throughout the course, we will use the penguin dataset. The task is to predict a species of penguin.

There are three possible options, often referred to as classes in machine learning:

And the features are: 'island', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', and 'sex'.

The data is contained in the penguins.csv file. We will load this file from a link using the pd.read_csv() function and look at the contents:


              12345
            
import pandas as pd

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv')

print(df.head(10))

Looking at this dataset, we can already find some issues we need to resolve. Those are:

Missing data;
Categorical variables;
Different scales.

Missing Data

Most ML algorithms can't handle missing values automatically, so we need to remove them (or replace them with some values, which is called imputing) before feeding the training set to a model.

pandas fills empty cells of the table with NaN. Most ML models will raise an error if at least one NaN exists in the data.

Categorical data

The data contains categorical data, which we already know can't be handled by machine learning models.

So we need to encode categorical data into numerical.

Different Scales

'culmen_depth_mm' values range from 13.1 to 21.5, while 'body_mass_g' values range from 2700 to 6300. Because of that, some ML models may consider the 'body_mass_g' feature much more important than 'culmen_depth_mm'.

Scaling solves this problem. It will be covered in later chapters.

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 2. Розділ 2

Запитати АІ

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат