ML Introduction with scikit-learn
Only a few machine learning models tolerate data with missing values. So we need to ensure our data does not contain any missing values. If it does, we can:
- Remove the row containing missing values
- Fill empty cells with some values. It is also called imputing.
To check if your dataset has missing values, you can use the
.info() method of a DataFrame.
Our data contains 344 entries, and columns 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', and 'sex' have less than 344 non-null values, so these columns contain missing values.
Null is another name for missing values.
Let's look at the rows containing any missing values.
We can print them using the
The first and the last row only contain the target ('species') and the 'island' values. We can safely remove those rows since they hold too little information.
For that, we will assign to
df only rows with less than two
In contrast, all other rows contain much more useful information and only contain
NaNs in the 'sex' column, so instead of removing them completely, we can just impute some values for the
NaN cells. It is often achieved using the
The next chapter will provide a more detailed explanation of
SimpleImputer, and you will have the opportunity to use it yourself!
Everything was clear?