Course Content
ML Introduction with scikit-learn
ML Introduction with scikit-learn
Getting Familiar with Dataset
Let's begin our preprocessing journey by exploring the dataset. Throughout the course, we will use the Penguin dataset. The task is to predict a species of penguin.
There are three possible options:
And the features are: 'island', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', and 'sex'.
The data is contained in a .csv file, penguins.csv
.
We will load this file from a link using the pd.read_csv()
function. Let's load it and look at the contents:
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.head(10))
Looking at this training set, we can already find some issues we need to resolve. Those are:
- Missing data;
- Categorical variables;
- Different scales.
Missing data
Most ML algorithms can't handle missing values automatically, so we need to remove them (or replace them with some values, which is called imputing) before feeding the training set to a model.
pandas
fills empty cells of the table with NaN
.
Most ML models will raise an error if at least one NaN
exists in the data.
Categorical data
The data contains categorical data, which we already know can't be handled by Machine Learning models.
So we need to encode categorical data into numerical.
Different Scales
'culmen_depth_mm' values range from 13.1 to 21.5 while 'body_mass_g' values range from 2700 to 6300.
Because of that, some ML models consider the 'body_mass_g' feature much more important than 'culmen_depth_mm'.
Scaling solves this problem. It will be covered in later chapters.
Thanks for your feedback!