Dealing with Missing Values
Only a limited number of machine learning models can handle missing values, so the dataset must be checked to ensure no gaps remain. If missing values are present, they can be addressed in two ways:
- Removing rows that contain missing values;
- Filling empty cells with substitute values, a process known as imputing.
Identifying Missing Values
To output general information about the dataset and check for missing values, you can use the .info()
method of a DataFrame.
12345import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.info())
The dataset has 344 entries, but the columns 'culmen_depth_mm'
, 'flipper_length_mm'
, 'body_mass_g'
, and 'sex'
each contain fewer than 344 non-null values, indicating the presence of missing data.
Null is another name for missing values.
To identify the number of missing values in each column, apply the .isna()
method and then use .sum()
.
12345import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.isna().sum())
Rows containing missing values can be displayed with:
df[df.isna().any(axis=1)]
12345import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df[df.isna().any(axis=1)])
Removing Rows
The first and last rows contain only the target ('species'
) and 'island'
values, providing too little information to be useful. These rows can be removed by keeping only those with fewer than two NaN
values and reassigning them to df
.
123456import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') df = df[df.isna().sum(axis=1) < 2] print(df.head(8))
In contrast, the remaining rows contain useful information, with NaN
values appearing only in the 'sex'
column. Instead of removing these rows, the missing values can be imputed. A common approach is to use the SimpleImputer
transformer, which will be covered in the next chapter.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 3.13
Dealing with Missing Values
Swipe to show menu
Only a limited number of machine learning models can handle missing values, so the dataset must be checked to ensure no gaps remain. If missing values are present, they can be addressed in two ways:
- Removing rows that contain missing values;
- Filling empty cells with substitute values, a process known as imputing.
Identifying Missing Values
To output general information about the dataset and check for missing values, you can use the .info()
method of a DataFrame.
12345import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.info())
The dataset has 344 entries, but the columns 'culmen_depth_mm'
, 'flipper_length_mm'
, 'body_mass_g'
, and 'sex'
each contain fewer than 344 non-null values, indicating the presence of missing data.
Null is another name for missing values.
To identify the number of missing values in each column, apply the .isna()
method and then use .sum()
.
12345import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.isna().sum())
Rows containing missing values can be displayed with:
df[df.isna().any(axis=1)]
12345import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df[df.isna().any(axis=1)])
Removing Rows
The first and last rows contain only the target ('species'
) and 'island'
values, providing too little information to be useful. These rows can be removed by keeping only those with fewer than two NaN
values and reassigning them to df
.
123456import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') df = df[df.isna().sum(axis=1) < 2] print(df.head(8))
In contrast, the remaining rows contain useful information, with NaN
values appearing only in the 'sex'
column. Instead of removing these rows, the missing values can be imputed. A common approach is to use the SimpleImputer
transformer, which will be covered in the next chapter.
Thanks for your feedback!