Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Dealing with Missing Values | Preprocessing Data with Scikit-learn
ML Introduction with scikit-learn

bookDealing with Missing Values

Only a limited number of machine learning models can handle missing values, so the dataset must be checked to ensure no gaps remain. If missing values are present, they can be addressed in two ways:

  • Removing rows that contain missing values;
  • Filling empty cells with substitute values, a process known as imputing.

Identifying Missing Values

To output general information about the dataset and check for missing values, you can use the .info() method of a DataFrame.

12345
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.info())
copy

The dataset has 344 entries, but the columns 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', and 'sex' each contain fewer than 344 non-null values, indicating the presence of missing data.

Note
Note

Null is another name for missing values.

To identify the number of missing values in each column, apply the .isna() method and then use .sum().

12345
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.isna().sum())
copy

Rows containing missing values can be displayed with: df[df.isna().any(axis=1)]

12345
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df[df.isna().any(axis=1)])
copy

Removing Rows

The first and last rows contain only the target ('species') and 'island' values, providing too little information to be useful. These rows can be removed by keeping only those with fewer than two NaN values and reassigning them to df.

123456
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') df = df[df.isna().sum(axis=1) < 2] print(df.head(8))
copy

In contrast, the remaining rows contain useful information, with NaN values appearing only in the 'sex' column. Instead of removing these rows, the missing values can be imputed. A common approach is to use the SimpleImputer transformer, which will be covered in the next chapter.

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 3

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 3.13

bookDealing with Missing Values

Swipe to show menu

Only a limited number of machine learning models can handle missing values, so the dataset must be checked to ensure no gaps remain. If missing values are present, they can be addressed in two ways:

  • Removing rows that contain missing values;
  • Filling empty cells with substitute values, a process known as imputing.

Identifying Missing Values

To output general information about the dataset and check for missing values, you can use the .info() method of a DataFrame.

12345
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.info())
copy

The dataset has 344 entries, but the columns 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', and 'sex' each contain fewer than 344 non-null values, indicating the presence of missing data.

Note
Note

Null is another name for missing values.

To identify the number of missing values in each column, apply the .isna() method and then use .sum().

12345
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.isna().sum())
copy

Rows containing missing values can be displayed with: df[df.isna().any(axis=1)]

12345
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df[df.isna().any(axis=1)])
copy

Removing Rows

The first and last rows contain only the target ('species') and 'island' values, providing too little information to be useful. These rows can be removed by keeping only those with fewer than two NaN values and reassigning them to df.

123456
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') df = df[df.isna().sum(axis=1) < 2] print(df.head(8))
copy

In contrast, the remaining rows contain useful information, with NaN values appearing only in the 'sex' column. Instead of removing these rows, the missing values can be imputed. A common approach is to use the SimpleImputer transformer, which will be covered in the next chapter.

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 3
some-alt