Course Content

Data Preprocessing

1. Brief Introduction

Data Types Data Processing Methods Dataset: Test and Training Deleting an "Extra" Data Changing the Data Type

2. Processing Quantitative Data

Data Scaling Data Scaling vs Data Normalization Removing Outliers Removing Missing Values Data Augmentation: Synthetic Data

3. Processing Categorical Data

Methods for Encoding the Categorical Data One-Hot Encoding Ordinal Encoding Label Encoding of the Target Variable Challenge

4. Time Series Data Processing

Data Type Conversion Data Cleaning Stationarity Denoising Train/Test Split & Cross Validation Challenge

5. Feature Engineering

Technique Idea Realization Feature Extraction from Text Feature Extraction from Images Feature Extraction from Time Series Challenge

6. Moving on to Tasks

Challenge 1 Challenge 2 Challenge 3

Realization

Now you have an idea of what feature engineering includes. Let's move on to practical implementation and look at the full pipeline in action.

In this example, we will demonstrate all the pipelines in one program for data preprocessing using the famous iris dataset. We will prepare the data, extract features, select the most relevant features, create new features, normalize and standardize the features, merge the features, evaluate their quality, realize the features, and integrate them for use in a machine learning model.

Data preparation: we will use the iris dataset from the scikit-learn library, which is already preprocessed and cleaned.
Feature reading: we will use the following features from the dataset: Sepal length, Sepal width, Petal length, Petal width.
Feature selection: we will use the SelectKBest method from scikit-learn to select the top 2 most relevant features based on their mutual information score.
Feature creation: we will create a new feature called 'Sepal to Petal Ratio' by dividing the sepal length by the petal length.
Standardization: we will use the StandardScaler method from scikit-learn to scale the selected features.
Feature merging: we will merge the selected and newly created features into one array.
Feature evaluation: we will evaluate the quality of the features by calculating their correlation coefficients.
Features with high correlation are more linearly dependent and hence have almost the same effect on the dependent variable. Then, when two features have a high correlation, we can drop one of the two features.
Integration and usage: finally, we will integrate the realized features into a machine-learning model for classification.

Note that there is a difference between feature selection and feature creation: feature selection refers to the process of selecting a subset of the available features in a dataset that is most relevant or informative for a given machine learning task. Feature creation, on the other hand, involves generating new features from the existing ones in order to capture more complex or abstract relationships between them.


              123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
            
# Import libraries
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.preprocessing import StandardScaler
import numpy as np

# Load dataset
iris = load_iris()

# Read features
X = iris.data
sepal_length = X[:, 0]
sepal_width = X[:, 1]
petal_length = X[:, 2]
petal_width = X[:, 3]

# Create new features
sepal_to_petal_ratio = sepal_length / petal_length
sepal_to_petal_ratio = np.reshape(sepal_to_petal_ratio, (-1, 1))

sepal_area = sepal_length * sepal_width
petal_area = petal_length * petal_width

ratio_sepal = sepal_length / sepal_width
ratio_petal = petal_length / petal_width

# Feature selection
kbest = SelectKBest(mutual_info_classif, k=2)
X_new = kbest.fit_transform(X, iris.target)

# Feature creation
X_new = np.hstack((X_new, sepal_to_petal_ratio))

# Scaling 
scaler = StandardScaler()
X_new = scaler.fit_transform(X_new)

# Feature merging
X_new = np.hstack((X_new, sepal_area.reshape(-1, 1)))
X_new = np.hstack((X_new, petal_area.reshape(-1, 1)))
X_new = np.hstack((X_new, ratio_sepal.reshape(-1, 1)))
X_new = np.hstack((X_new, ratio_petal.reshape(-1, 1)))

# Feature evaluation
correlation_matrix = np.corrcoef(X_new.T)
print('Correlation Matrix:')
print(correlation_matrix)
X_new = np.array(X_new, np.float32)

# Integration and usage
# The realized features can now be used in a machine learning model for classification

What is the difference between feature selection and feature creation?

Select the correct answer

Feature selection involves selecting a subset of the original features to use in a machine learning model, while feature creation involves creating new features from the original features to use in a machine learning model.

Feature selection and feature creation are the same thing, and can be used interchangeably.

Feature selection involves creating new features by combining or transforming existing ones, while feature creation involves identifying the most important or relevant features in a dataset and discarding the rest.

Feature creation is used when the original features are not sufficient to capture the relevant information in the data, while feature selection is used when the number of features is very large and there is a risk of overfitting.

Everything was clear?

Thanks for your feedback!

Section 5. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat