Course Content
ML Introduction with scikit-learn
ML Introduction with scikit-learn
Impute Missing Values
SimpleImputer
replaces missing values with a specific single value (for each column its own single value).
But we need to choose the value to impute.
The popular choice is to take the mean for numerical values and mode (most frequent value) for categorical values. It can be controlled using the strategy
parameter:
strategy='mean'
– impute with mean along each column;strategy='median'
– impute with median along each column;strategy='most_frequent'
– impute with mode along each column;strategy='constant'
– impute with constant number specified infill_value
parameter.
The missing_values
parameter controls what values are considered missing. By default, it is NaN
, but in different datasets, it can be an empty string ''
or anything else.
Note
The
SimpleImputer
and many other transformers do not work with the pandas Series, only with DataFrame. But selecting a single column from a DataFrame (df['column']
) returns a Series. To avoid it, you can use thedf[['column']]
(notice double brackets) like this:
When you use the .fit_transform()
method of the SimpleImputer
, it returns a 2D array, but pandas expects a 1D array (or a Series) when assigning to a DataFrame column. To resolve this, you can use the .ravel()
method to convert the 2D array to a 1D array before assigning it back to the column:
Swipe to show code editor
Your task is to impute the NaN
values of the 'sex'
column using SimpleImputer
. Since you are dealing with a categorical column, you will replace null values with the most frequent value.
- Import the
SimpleImputer
. - Create a
SimpleImputer
object with the desiredstrategy
. - Fit and transform a column using the
imputer
object and assign it to the'sex'
column ofdf
.
Once you've completed this task, click the button below the code to check your solution.
Great! We dealt with the missing values problem in our dataset. We removed the rows with more than one null and imputed the 'sex'
column with the most frequent value – MALE
.
Thanks for your feedback!
Impute Missing Values
SimpleImputer
replaces missing values with a specific single value (for each column its own single value).
But we need to choose the value to impute.
The popular choice is to take the mean for numerical values and mode (most frequent value) for categorical values. It can be controlled using the strategy
parameter:
strategy='mean'
– impute with mean along each column;strategy='median'
– impute with median along each column;strategy='most_frequent'
– impute with mode along each column;strategy='constant'
– impute with constant number specified infill_value
parameter.
The missing_values
parameter controls what values are considered missing. By default, it is NaN
, but in different datasets, it can be an empty string ''
or anything else.
Note
The
SimpleImputer
and many other transformers do not work with the pandas Series, only with DataFrame. But selecting a single column from a DataFrame (df['column']
) returns a Series. To avoid it, you can use thedf[['column']]
(notice double brackets) like this:
When you use the .fit_transform()
method of the SimpleImputer
, it returns a 2D array, but pandas expects a 1D array (or a Series) when assigning to a DataFrame column. To resolve this, you can use the .ravel()
method to convert the 2D array to a 1D array before assigning it back to the column:
Swipe to show code editor
Your task is to impute the NaN
values of the 'sex'
column using SimpleImputer
. Since you are dealing with a categorical column, you will replace null values with the most frequent value.
- Import the
SimpleImputer
. - Create a
SimpleImputer
object with the desiredstrategy
. - Fit and transform a column using the
imputer
object and assign it to the'sex'
column ofdf
.
Once you've completed this task, click the button below the code to check your solution.
Great! We dealt with the missing values problem in our dataset. We removed the rows with more than one null and imputed the 'sex'
column with the most frequent value – MALE
.
Thanks for your feedback!
Impute Missing Values
SimpleImputer
replaces missing values with a specific single value (for each column its own single value).
But we need to choose the value to impute.
The popular choice is to take the mean for numerical values and mode (most frequent value) for categorical values. It can be controlled using the strategy
parameter:
strategy='mean'
– impute with mean along each column;strategy='median'
– impute with median along each column;strategy='most_frequent'
– impute with mode along each column;strategy='constant'
– impute with constant number specified infill_value
parameter.
The missing_values
parameter controls what values are considered missing. By default, it is NaN
, but in different datasets, it can be an empty string ''
or anything else.
Note
The
SimpleImputer
and many other transformers do not work with the pandas Series, only with DataFrame. But selecting a single column from a DataFrame (df['column']
) returns a Series. To avoid it, you can use thedf[['column']]
(notice double brackets) like this:
When you use the .fit_transform()
method of the SimpleImputer
, it returns a 2D array, but pandas expects a 1D array (or a Series) when assigning to a DataFrame column. To resolve this, you can use the .ravel()
method to convert the 2D array to a 1D array before assigning it back to the column:
Swipe to show code editor
Your task is to impute the NaN
values of the 'sex'
column using SimpleImputer
. Since you are dealing with a categorical column, you will replace null values with the most frequent value.
- Import the
SimpleImputer
. - Create a
SimpleImputer
object with the desiredstrategy
. - Fit and transform a column using the
imputer
object and assign it to the'sex'
column ofdf
.
Once you've completed this task, click the button below the code to check your solution.
Great! We dealt with the missing values problem in our dataset. We removed the rows with more than one null and imputed the 'sex'
column with the most frequent value – MALE
.
Thanks for your feedback!
SimpleImputer
replaces missing values with a specific single value (for each column its own single value).
But we need to choose the value to impute.
The popular choice is to take the mean for numerical values and mode (most frequent value) for categorical values. It can be controlled using the strategy
parameter:
strategy='mean'
– impute with mean along each column;strategy='median'
– impute with median along each column;strategy='most_frequent'
– impute with mode along each column;strategy='constant'
– impute with constant number specified infill_value
parameter.
The missing_values
parameter controls what values are considered missing. By default, it is NaN
, but in different datasets, it can be an empty string ''
or anything else.
Note
The
SimpleImputer
and many other transformers do not work with the pandas Series, only with DataFrame. But selecting a single column from a DataFrame (df['column']
) returns a Series. To avoid it, you can use thedf[['column']]
(notice double brackets) like this:
When you use the .fit_transform()
method of the SimpleImputer
, it returns a 2D array, but pandas expects a 1D array (or a Series) when assigning to a DataFrame column. To resolve this, you can use the .ravel()
method to convert the 2D array to a 1D array before assigning it back to the column:
Swipe to show code editor
Your task is to impute the NaN
values of the 'sex'
column using SimpleImputer
. Since you are dealing with a categorical column, you will replace null values with the most frequent value.
- Import the
SimpleImputer
. - Create a
SimpleImputer
object with the desiredstrategy
. - Fit and transform a column using the
imputer
object and assign it to the'sex'
column ofdf
.
Once you've completed this task, click the button below the code to check your solution.
Great! We dealt with the missing values problem in our dataset. We removed the rows with more than one null and imputed the 'sex'
column with the most frequent value – MALE
.