Зміст курсу
Data Preprocessing
Data Preprocessing
Technique Idea
Feature engineering - is a process used in machine learning that involves utilizing available data to generate additional variables that do not exist in the original dataset. This technique can be applied to both supervised and unsupervised learning, with the objective of streamlining and accelerating data processing while also improving the accuracy of the model. The goal is to construct new features that can enhance the model's predictive power.
Depending on the task, the data preparation algorithm may change, but the full feature engineering pipeline usually looks like this:
Let's immediately turn to a simple example to understand the essence of the approach. We can create new features from the stock price data using pandas
and numpy
libraries:
import pandas as pd import numpy as np # Read the dataset df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/9c23bf60-276c-4989-a9d7-3091716b4507/datasets/stock_prices.csv') # Remove missing values df.dropna(inplace=True) # Normalize the data df['price'] = (df['price'] - df['price'].mean()) / df['price'].std() # Create new features df['price_sq'] = df['price'] ** 2 df['price_diff'] = df['price'].diff() df['price_pct_change'] = df['price'].pct_change() # Create interactive features df['price_company_mean'] = df.groupby('company')['price'].transform('mean') df['price_company_std'] = df.groupby('company')['price'].transform('std') # Convert categorical data to numerical data df = pd.get_dummies(df, columns=['company']) print(df)
In this example, we have created 5 new features based on the 2 original variables.
Another example of what new data we can extract from what we already have. Suppose you have a dataset with online sales in your store. It looks like this:
Based on the 'Date'
variable, you can create a boolean token of a weekend:
Due to this, for example, an ARIMA model for sales prediction will work better because it will be able to detect the relationship between the positive value of the 'Weekend'
column and increased sales.
Various methods are used in the implementation of feature engineering, including removing outliers and missing values, data normalization, creating new features based on existing ones, converting categorical data to numerical data, creating interactive features, etc.
How to evaluate the effectiveness of new or even existing features?
- The performance of the model can then be compared with and without the new features to see if they have improved the model's accuracy;
- Another approach is to use feature importance techniques, such as permutation importance or feature importance scores, to determine the contribution of each feature to the model's accuracy.
Дякуємо за ваш відгук!