Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
course content

Course Content

Python for Data Science: Job Change

Data PreprocessingData Preprocessing

Now that we have explored our data, we need to preprocess it in order to feed the features into a Machine Learning Algorithm.

Methods description

  • sklearn: This module provides a collection of machine learning algorithms and tools for data preprocessing, model selection, and evaluation. It is commonly used for tasks such as classification, regression, clustering, and dimensionality reduction;
  • preprocessing: This submodule within sklearn contains utilities for preprocessing data before feeding it into a machine learning model. It includes methods for scaling, normalization, encoding categorical variables, and handling missing values;
  • LabelEncoder(): This class from the preprocessing submodule is used to encode categorical variables into numerical labels. It assigns a unique integer to each category in the variable
  • .drop_duplicates(): This method is used to remove duplicate rows from a DataFrame. It identifies rows with identical values across all columns and keeps only the first occurrence;
  • .select_dtypes(): This method is used to select columns from a DataFrame based on their data types. It allows filtering columns by specifying the desired data types, such as integers (int64) or floating-point numbers (float64);
  • .fillna(): This method is used to fill missing values in a DataFrame with specified values. It is commonly used to impute missing data using statistical measures like mean, median, or mode;
  • .mean(): This method calculates the mean (average) of values along a specified axis in a DataFrame or Series;
  • .mode(): This method calculates the mode (most frequent value) of values in a DataFrame or Series. It returns a Series containing the mode(s) of the data;
  • .fit_transform(): This method is commonly used in sklearn transformers to fit the transformation to the data and then apply it. It first learns the parameters necessary for the transformation from the data and then transforms the data accordingly. In the case of LabelEncoder, it fits the encoder to the data and then transforms the categorical variables into numerical labels simultaneously.

Task

  1. Remove duplicates (they do not provide any meaningful information for our analysis);
  2. Replace null values with:
  • Mode for categorical columns;
  • Mean for numerical columns.

Mark tasks as Completed

Everything was clear?

Section 1. Chapter 4
AVAILABLE TO ULTIMATE ONLY
some-alt