Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Dealing with Missing Values | Section
Machine Learning Foundations with Scikit-Learn

bookDealing with Missing Values

Swipe um das Menü anzuzeigen

Only a limited number of machine learning models can handle missing values, so the dataset must be checked to ensure no gaps remain. If missing values are present, they can be addressed in two ways:

  • Removing rows that contain missing values;
  • Filling empty cells with substitute values, a process known as imputing.

Identifying Missing Values

To output general information about the dataset and check for missing values, you can use the .info() method of a DataFrame.

12345
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.info())
copy

The dataset has 344 entries, but the columns 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', and 'sex' each contain fewer than 344 non-null values, indicating the presence of missing data.

Note
Note

Null is another name for missing values.

To identify the number of missing values in each column, apply the .isna() method and then use .sum().

12345
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.isna().sum())
copy

Rows containing missing values can be displayed with: df[df.isna().any(axis=1)]

12345
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df[df.isna().any(axis=1)])
copy

Removing Rows

The first and last rows contain only the target ('species') and 'island' values, providing too little information to be useful. These rows can be removed by keeping only those with fewer than two NaN values and reassigning them to df.

123456
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') df = df[df.isna().sum(axis=1) < 2] print(df.head(8))
copy

In contrast, the remaining rows contain useful information, with NaN values appearing only in the 'sex' column. Instead of removing these rows, the missing values can be imputed. A common approach is to use the SimpleImputer transformer, which will be covered in the next chapter.

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 1. Kapitel 8

Fragen Sie AI

expand

Fragen Sie AI

ChatGPT

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Abschnitt 1. Kapitel 8
some-alt