Data Preprocessing

Now that we have explored our data, we need to preprocess it in order to feed the features into a Machine Learning Algorithm.

Methods description

sklearn: This module provides a collection of machine learning algorithms and tools for data preprocessing, model selection, and evaluation. It is commonly used for tasks such as classification, regression, clustering, and dimensionality reduction;
preprocessing: This submodule within sklearn contains utilities for preprocessing data before feeding it into a machine learning model. It includes methods for scaling, normalization, encoding categorical variables, and handling missing values;
LabelEncoder(): This class from the preprocessing submodule is used to encode categorical variables into numerical labels. It assigns a unique integer to each category in the variable;
.drop_duplicates(): This method is used to remove duplicate rows from a DataFrame. It identifies rows with identical values across all columns and keeps only the first occurrence;
.select_dtypes(): This method is used to select columns from a DataFrame based on their data types. It allows filtering columns by specifying the desired data types, such as integers (int64) or floating-point numbers (float64);
.fillna(): This method is used to fill missing values in a DataFrame with specified values. It is commonly used to impute missing data using statistical measures like mean, median, or mode;
.mean(): This method calculates the mean (average) of values along a specified axis in a DataFrame or Series;
.mode(): This method calculates the mode (most frequent value) of values in a DataFrame or Series. It returns a Series containing the mode(s) of the data;
.fit_transform(): This method is commonly used in sklearn transformers to fit the transformation to the data and then apply it. It first learns the parameters necessary for the transformation from the data and then transforms the data accordingly. In the case of LabelEncoder, it fits the encoder to the data and then transforms the categorical variables into numerical labels simultaneously.

Task

Swipe to start coding

Remove duplicates (they do not provide any meaningful information for our analysis).
Replace null values with:

Mode for categorical columns.
Mean for numerical columns.

Solution

Mark tasks as Completed

Switch to desktop for real-world practiceContinue from where you are using one of the options below

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 4

AVAILABLE TO ULTIMATE ONLY