Course Content

Data Preprocessing

1. Brief Introduction

Data Types Data Processing Methods Dataset: Test and Training Deleting an "Extra" Data Changing the Data Type

2. Processing Quantitative Data

Data Scaling Data Scaling vs Data Normalization Removing Outliers Removing Missing Values Data Augmentation: Synthetic Data

3. Processing Categorical Data

Methods for Encoding the Categorical Data One-Hot Encoding Ordinal Encoding Label Encoding of the Target Variable Challenge

4. Time Series Data Processing

Data Type Conversion Data Cleaning Stationarity Denoising Train/Test Split & Cross Validation Challenge

5. Feature Engineering

Technique Idea Realization Feature Extraction from Text Feature Extraction from Images Feature Extraction from Time Series Challenge

6. Moving on to Tasks

Challenge 1 Challenge 2 Challenge 3

Technique Idea

Feature engineering - is a process used in machine learning that involves utilizing available data to generate additional variables that do not exist in the original dataset. This technique can be applied to both supervised and unsupervised learning, with the objective of streamlining and accelerating data processing while also improving the accuracy of the model. The goal is to construct new features that can enhance the model's predictive power.

Depending on the task, the data preparation algorithm may change, but the full feature engineering pipeline usually looks like this:

Let's immediately turn to a simple example to understand the essence of the approach. We can create new features from the stock price data using pandas and numpy libraries:


              123456789101112131415161718192021222324
            
import pandas as pd
import numpy as np

# Read the dataset
df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/9c23bf60-276c-4989-a9d7-3091716b4507/datasets/stock_prices.csv')

# Remove missing values
df.dropna(inplace=True)

# Normalize the data
df['price'] = (df['price'] - df['price'].mean()) / df['price'].std()

# Create new features
df['price_sq'] = df['price'] ** 2
df['price_diff'] = df['price'].diff()
df['price_pct_change'] = df['price'].pct_change()

# Create interactive features
df['price_company_mean'] = df.groupby('company')['price'].transform('mean')
df['price_company_std'] = df.groupby('company')['price'].transform('std')

# Convert categorical data to numerical data
df = pd.get_dummies(df, columns=['company'])
print(df)

In this example, we have created 5 new features based on the 2 original variables.

Another example of what new data we can extract from what we already have. Suppose you have a dataset with online sales in your store. It looks like this:

Based on the 'Date' variable, you can create a boolean token of a weekend:

Due to this, for example, an ARIMA model for sales prediction will work better because it will be able to detect the relationship between the positive value of the 'Weekend' column and increased sales.

Various methods are used in the implementation of feature engineering, including removing outliers and missing values, data normalization, creating new features based on existing ones, converting categorical data to numerical data, creating interactive features, etc.

How to evaluate the effectiveness of new or even existing features?

The performance of the model can then be compared with and without the new features to see if they have improved the model's accuracy;
Another approach is to use feature importance techniques, such as permutation importance or feature importance scores, to determine the contribution of each feature to the model's accuracy.

Everything was clear?

Thanks for your feedback!

Section 5. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat