Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Impute Missing Values | Preprocessing Data with Scikit-learn
ML Introduction with scikit-learn
course content

Course Content

ML Introduction with scikit-learn

ML Introduction with scikit-learn

1. Machine Learning Concepts
2. Preprocessing Data with Scikit-learn
3. Pipelines
4. Modeling

book
Impute Missing Values

SimpleImputer replaces missing values with a specific single value (for each column its own single value).

But we need to choose the value to impute.
The popular choice is to take the mean for numerical values and mode (most frequent value) for categorical values. It can be controlled using the strategy parameter:

  • strategy='mean' – impute with mean along each column;
  • strategy='median' – impute with median along each column;
  • strategy='most_frequent' – impute with mode along each column;
  • strategy='constant' – impute with constant number specified in fill_value parameter.

The missing_values parameter controls what values are considered missing. By default, it is NaN, but in different datasets, it can be an empty string '' or anything else.

Note

The SimpleImputer and many other transformers do not work with the pandas Series, only with DataFrame. But selecting a single column from a DataFrame (df['column']) returns a Series. To avoid it, you can use the df[['column']] (notice double brackets) like this:

When you use the .fit_transform() method of the SimpleImputer, it returns a 2D array, but pandas expects a 1D array (or a Series) when assigning to a DataFrame column. To resolve this, you can use the .ravel() method to convert the 2D array to a 1D array before assigning it back to the column:

Task
test

Swipe to show code editor

Your task is to impute the NaN values of the 'sex' column using SimpleImputer. Since you are dealing with a categorical column, you will replace null values with the most frequent value.

  1. Import the SimpleImputer.
  2. Create a SimpleImputer object with the desired strategy.
  3. Fit and transform a column using the imputer object and assign it to the 'sex' column of df.

Once you've completed this task, click the button below the code to check your solution.

Solution

Great! We dealt with the missing values problem in our dataset. We removed the rows with more than one null and imputed the 'sex' column with the most frequent value – MALE.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 2. Chapter 4
toggle bottom row

book
Impute Missing Values

SimpleImputer replaces missing values with a specific single value (for each column its own single value).

But we need to choose the value to impute.
The popular choice is to take the mean for numerical values and mode (most frequent value) for categorical values. It can be controlled using the strategy parameter:

  • strategy='mean' – impute with mean along each column;
  • strategy='median' – impute with median along each column;
  • strategy='most_frequent' – impute with mode along each column;
  • strategy='constant' – impute with constant number specified in fill_value parameter.

The missing_values parameter controls what values are considered missing. By default, it is NaN, but in different datasets, it can be an empty string '' or anything else.

Note

The SimpleImputer and many other transformers do not work with the pandas Series, only with DataFrame. But selecting a single column from a DataFrame (df['column']) returns a Series. To avoid it, you can use the df[['column']] (notice double brackets) like this:

When you use the .fit_transform() method of the SimpleImputer, it returns a 2D array, but pandas expects a 1D array (or a Series) when assigning to a DataFrame column. To resolve this, you can use the .ravel() method to convert the 2D array to a 1D array before assigning it back to the column:

Task
test

Swipe to show code editor

Your task is to impute the NaN values of the 'sex' column using SimpleImputer. Since you are dealing with a categorical column, you will replace null values with the most frequent value.

  1. Import the SimpleImputer.
  2. Create a SimpleImputer object with the desired strategy.
  3. Fit and transform a column using the imputer object and assign it to the 'sex' column of df.

Once you've completed this task, click the button below the code to check your solution.

Solution

Great! We dealt with the missing values problem in our dataset. We removed the rows with more than one null and imputed the 'sex' column with the most frequent value – MALE.

Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 2. Chapter 4
Switch to desktopSwitch to desktop for real-world practiceContinue from where you are using one of the options below
We're sorry to hear that something went wrong. What happened?
some-alt