Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Technique Idea | Feature Engineering
Data Preprocessing
course content

Зміст курсу

Data Preprocessing

Data Preprocessing

1. Brief Introduction
2. Processing Quantitative Data
3. Processing Categorical Data
4. Time Series Data Processing
5. Feature Engineering
6. Moving on to Tasks

bookTechnique Idea

Feature engineering - is a process used in machine learning that involves utilizing available data to generate additional variables that do not exist in the original dataset. This technique can be applied to both supervised and unsupervised learning, with the objective of streamlining and accelerating data processing while also improving the accuracy of the model. The goal is to construct new features that can enhance the model's predictive power.

Depending on the task, the data preparation algorithm may change, but the full feature engineering pipeline usually looks like this:

Let's immediately turn to a simple example to understand the essence of the approach. We can create new features from the stock price data using pandas and numpy libraries:

123456789101112131415161718192021222324
import pandas as pd import numpy as np # Read the dataset df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/9c23bf60-276c-4989-a9d7-3091716b4507/datasets/stock_prices.csv') # Remove missing values df.dropna(inplace=True) # Normalize the data df['price'] = (df['price'] - df['price'].mean()) / df['price'].std() # Create new features df['price_sq'] = df['price'] ** 2 df['price_diff'] = df['price'].diff() df['price_pct_change'] = df['price'].pct_change() # Create interactive features df['price_company_mean'] = df.groupby('company')['price'].transform('mean') df['price_company_std'] = df.groupby('company')['price'].transform('std') # Convert categorical data to numerical data df = pd.get_dummies(df, columns=['company']) print(df)
copy

In this example, we have created 5 new features based on the 2 original variables.

Another example of what new data we can extract from what we already have. Suppose you have a dataset with online sales in your store. It looks like this:

Based on the 'Date' variable, you can create a boolean token of a weekend:

Due to this, for example, an ARIMA model for sales prediction will work better because it will be able to detect the relationship between the positive value of the 'Weekend' column and increased sales.

Various methods are used in the implementation of feature engineering, including removing outliers and missing values, data normalization, creating new features based on existing ones, converting categorical data to numerical data, creating interactive features, etc.

How to evaluate the effectiveness of new or even existing features?

  • The performance of the model can then be compared with and without the new features to see if they have improved the model's accuracy;
  • Another approach is to use feature importance techniques, such as permutation importance or feature importance scores, to determine the contribution of each feature to the model's accuracy.

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 5. Розділ 1
some-alt