Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Dealing with Missing Values | Preprocessing Data with Scikit-learn
ML Introduction with scikit-learn

Dealing with Missing ValuesDealing with Missing Values

Only a few machine learning models tolerate data with missing values. So we need to ensure our data does not contain any missing values. If it does, we can:

  • Remove the row containing missing values;
  • Fill empty cells with some values. It is also called imputing.

To check if your dataset has missing values, you can use the .info() method of a DataFrame.

Our data contains 344 entries, and columns 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', and 'sex' have less than 344 non-null values, so these columns contain missing values.

Note

Null is another name for missing values.

Let's look at the rows containing any missing values.
We can print them using the df[df.isna().any(axis=1)] code.

Removing rows

The first and the last row only contain the target ('species') and the 'island' values. We can safely remove those rows since they hold too little information.
For that, we will assign to df only rows with less than two NaN values.

Impute

In contrast, all other rows contain much more useful information and only contain NaNs in the 'sex' column, so instead of removing them completely, we can just impute some values for the NaN cells. It is often achieved using the SimpleImputer transformer.

The next chapter will provide a more detailed explanation of SimpleImputer, and you will have the opportunity to use it yourself!

Все було зрозуміло?

Секція 2. Розділ 3
course content

Зміст курсу

ML Introduction with scikit-learn

Dealing with Missing ValuesDealing with Missing Values

Only a few machine learning models tolerate data with missing values. So we need to ensure our data does not contain any missing values. If it does, we can:

  • Remove the row containing missing values;
  • Fill empty cells with some values. It is also called imputing.

To check if your dataset has missing values, you can use the .info() method of a DataFrame.

Our data contains 344 entries, and columns 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', and 'sex' have less than 344 non-null values, so these columns contain missing values.

Note

Null is another name for missing values.

Let's look at the rows containing any missing values.
We can print them using the df[df.isna().any(axis=1)] code.

Removing rows

The first and the last row only contain the target ('species') and the 'island' values. We can safely remove those rows since they hold too little information.
For that, we will assign to df only rows with less than two NaN values.

Impute

In contrast, all other rows contain much more useful information and only contain NaNs in the 'sex' column, so instead of removing them completely, we can just impute some values for the NaN cells. It is often achieved using the SimpleImputer transformer.

The next chapter will provide a more detailed explanation of SimpleImputer, and you will have the opportunity to use it yourself!

Все було зрозуміло?

Секція 2. Розділ 3
some-alt