Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Getting Familiar with Dataset | Preprocessing Data with Scikit-learn
course content

Зміст курсу

ML Introduction with scikit-learn

Getting Familiar with DatasetGetting Familiar with Dataset

Let's begin our preprocessing journey by exploring the dataset. Throughout the course, we will use the Penguin dataset. The task is to predict a species of penguin.

There are three possible options:

And the features are: 'island', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', and 'sex'.
The data is contained in a .csv file, penguins.csv.

We will load this file from a link using the pd.read_csv() function. Let's load it and look at the contents:

Looking at this training set, we can already find some issues we need to resolve. Those are:

  • Missing data;
  • Categorical variables;
  • Different scales.

Missing data

Most ML algorithms can't handle missing values automatically, so we need to remove them (or replace them with some values, which is called imputing) before feeding the training set to a model.
pandas fills empty cells of the table with NaN.
Most ML models will raise an error if at least one NaN exists in the data.

Categorical data

The data contains categorical data, which we already know can't be handled by Machine Learning models.

So we need to encode categorical data into numerical.

Different Scales

'culmen_depth_mm' values range from 13.1 to 21.5 while 'body_mass_g' values range from 2700 to 6300.
Because of that, some ML models consider the 'body_mass_g' feature much more important than 'culmen_depth_mm'.

Scaling solves this problem. It will be covered in later chapters.

question-icon

Match the problem with a way to solve it.

Missing values –
Categorical data –

Different Scales –

Натисніть або перетягніть елементи та заповніть пропуски

Все було зрозуміло?

Секція 2. Розділ 2
course content

Зміст курсу

ML Introduction with scikit-learn

Getting Familiar with DatasetGetting Familiar with Dataset

Let's begin our preprocessing journey by exploring the dataset. Throughout the course, we will use the Penguin dataset. The task is to predict a species of penguin.

There are three possible options:

And the features are: 'island', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', and 'sex'.
The data is contained in a .csv file, penguins.csv.

We will load this file from a link using the pd.read_csv() function. Let's load it and look at the contents:

Looking at this training set, we can already find some issues we need to resolve. Those are:

  • Missing data;
  • Categorical variables;
  • Different scales.

Missing data

Most ML algorithms can't handle missing values automatically, so we need to remove them (or replace them with some values, which is called imputing) before feeding the training set to a model.
pandas fills empty cells of the table with NaN.
Most ML models will raise an error if at least one NaN exists in the data.

Categorical data

The data contains categorical data, which we already know can't be handled by Machine Learning models.

So we need to encode categorical data into numerical.

Different Scales

'culmen_depth_mm' values range from 13.1 to 21.5 while 'body_mass_g' values range from 2700 to 6300.
Because of that, some ML models consider the 'body_mass_g' feature much more important than 'culmen_depth_mm'.

Scaling solves this problem. It will be covered in later chapters.

question-icon

Match the problem with a way to solve it.

Missing values –
Categorical data –

Different Scales –

Натисніть або перетягніть елементи та заповніть пропуски

Все було зрозуміло?

Секція 2. Розділ 2
some-alt