Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Impute Missing Values | Preprocessing Data with Scikit-learn
course content

Зміст курсу

ML Introduction with scikit-learn

Impute Missing ValuesImpute Missing Values

SimpleImputer replaces missing values with a specific single value (for each column its own single value).

But we need to choose the value to impute.
The popular choice is to take the mean for numerical values and mode (most frequent value) for categorical values. It can be controlled using the strategy parameter:

  • strategy='mean' – impute with mean along each column;
  • strategy='median' – impute with median along each column;
  • strategy='most_frequent' – impute with mode along each column;
  • strategy='constant' – impute with constant number specified in fill_value parameter.

The missing_values parameter controls what values are considered missing. By default, it is NaN, but in different datasets, it can be an empty string '' or anything else.

Note

The SimpleImputer and many other transformers do not work with the pandas Series, only with DataFrame. But selecting a single column from a DataFrame (df['column']) returns a Series. To avoid it, you can use the df[['column']] (notice double brackets) like this:

When you use the .fit_transform() method of the SimpleImputer, it returns a 2D array, but pandas expects a 1D array (or a Series) when assigning to a DataFrame column. To resolve this, you can use the .ravel() method to convert the 2D array to a 1D array before assigning it back to the column:

Завдання

Your task is to impute the NaN values of the 'sex' column using SimpleImputer. Since you are dealing with a categorical column, you will replace null values with the most frequent value.

  1. Import the SimpleImputer.
  2. Create a SimpleImputer object with the desired strategy.
  3. Fit and transform a column using the imputer object and assign it to the 'sex' column of df.

Once you've completed this task, click the button below the code to check your solution.

Great! We dealt with the missing values problem in our dataset. We removed the rows with more than one null and imputed the 'sex' column with the most frequent value – MALE.

Все було зрозуміло?

Секція 2. Розділ 4
toggle bottom row
course content

Зміст курсу

ML Introduction with scikit-learn

Impute Missing ValuesImpute Missing Values

SimpleImputer replaces missing values with a specific single value (for each column its own single value).

But we need to choose the value to impute.
The popular choice is to take the mean for numerical values and mode (most frequent value) for categorical values. It can be controlled using the strategy parameter:

  • strategy='mean' – impute with mean along each column;
  • strategy='median' – impute with median along each column;
  • strategy='most_frequent' – impute with mode along each column;
  • strategy='constant' – impute with constant number specified in fill_value parameter.

The missing_values parameter controls what values are considered missing. By default, it is NaN, but in different datasets, it can be an empty string '' or anything else.

Note

The SimpleImputer and many other transformers do not work with the pandas Series, only with DataFrame. But selecting a single column from a DataFrame (df['column']) returns a Series. To avoid it, you can use the df[['column']] (notice double brackets) like this:

When you use the .fit_transform() method of the SimpleImputer, it returns a 2D array, but pandas expects a 1D array (or a Series) when assigning to a DataFrame column. To resolve this, you can use the .ravel() method to convert the 2D array to a 1D array before assigning it back to the column:

Завдання

Your task is to impute the NaN values of the 'sex' column using SimpleImputer. Since you are dealing with a categorical column, you will replace null values with the most frequent value.

  1. Import the SimpleImputer.
  2. Create a SimpleImputer object with the desired strategy.
  3. Fit and transform a column using the imputer object and assign it to the 'sex' column of df.

Once you've completed this task, click the button below the code to check your solution.

Great! We dealt with the missing values problem in our dataset. We removed the rows with more than one null and imputed the 'sex' column with the most frequent value – MALE.

Все було зрозуміло?

Секція 2. Розділ 4
toggle bottom row
some-alt