Вивчайте Technique Idea | Feature Engineering

Feature engineering - is a process used in machine learning that involves utilizing available data to generate additional variables that do not exist in the original dataset. This technique can be applied to both supervised and unsupervised learning, with the objective of streamlining and accelerating data processing while also improving the accuracy of the model. The goal is to construct new features that can enhance the model's predictive power.

Depending on the task, the data preparation algorithm may change, but the full feature engineering pipeline usually looks like this:

Let's immediately turn to a simple example to understand the essence of the approach. We can create new features from the stock price data using pandas and numpy libraries:


              123456789101112131415161718192021222324
            
import pandas as pd
import numpy as np

# Read the dataset
df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/9c23bf60-276c-4989-a9d7-3091716b4507/datasets/stock_prices.csv')

# Remove missing values
df.dropna(inplace=True)

# Normalize the data
df['price'] = (df['price'] - df['price'].mean()) / df['price'].std()

# Create new features
df['price_sq'] = df['price'] ** 2
df['price_diff'] = df['price'].diff()
df['price_pct_change'] = df['price'].pct_change()

# Create interactive features
df['price_company_mean'] = df.groupby('company')['price'].transform('mean')
df['price_company_std'] = df.groupby('company')['price'].transform('std')

# Convert categorical data to numerical data
df = pd.get_dummies(df, columns=['company'])
print(df)

In this example, we have created 5 new features based on the 2 original variables.

Another example of what new data we can extract from what we already have. Suppose you have a dataset with online sales in your store. It looks like this:

Based on the 'Date' variable, you can create a boolean token of a weekend:

Due to this, for example, an ARIMA model for sales prediction will work better because it will be able to detect the relationship between the positive value of the 'Weekend' column and increased sales.

Various methods are used in the implementation of feature engineering, including removing outliers and missing values, data normalization, creating new features based on existing ones, converting categorical data to numerical data, creating interactive features, etc.

How to evaluate the effectiveness of new or even existing features?

The performance of the model can then be compared with and without the new features to see if they have improved the model's accuracy;
Another approach is to use feature importance techniques, such as permutation importance or feature importance scores, to determine the contribution of each feature to the model's accuracy.

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 5. Розділ 1

Запитати АІ

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Свайпніть щоб показати меню