Learn Dealing with Missing Values | Preprocessing Data with Scikit-learn

Only a limited number of machine learning models can handle missing values, so the dataset must be checked to ensure no gaps remain. If missing values are present, they can be addressed in two ways:

Removing rows that contain missing values;
Filling empty cells with substitute values, a process known as imputing.

Identifying Missing Values

To output general information about the dataset and check for missing values, you can use the .info() method of a DataFrame.


              12345
            
import pandas as pd

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv')

print(df.info())

The dataset has 344 entries, but the columns 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', and 'sex' each contain fewer than 344 non-null values, indicating the presence of missing data.

Note

Null is another name for missing values.

To identify the number of missing values in each column, apply the .isna() method and then use .sum().


              12345
            
import pandas as pd

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv')

print(df.isna().sum())

Rows containing missing values can be displayed with: df[df.isna().any(axis=1)]


              12345
            
import pandas as pd

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv')

print(df[df.isna().any(axis=1)])

Removing Rows

The first and last rows contain only the target ('species') and 'island' values, providing too little information to be useful. These rows can be removed by keeping only those with fewer than two NaN values and reassigning them to df.


              123456
            
import pandas as pd

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv')

df = df[df.isna().sum(axis=1) < 2]
print(df.head(8))

In contrast, the remaining rows contain useful information, with NaN values appearing only in the 'sex' column. Instead of removing these rows, the missing values can be imputed. A common approach is to use the SimpleImputer transformer, which will be covered in the next chapter.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Swipe to show menu