Зміст курсу
Data Preprocessing
Data Preprocessing
Realization
Now you have an idea of what feature engineering includes. Let's move on to practical implementation and look at the full pipeline in action.
In this example, we will demonstrate all the pipelines in one program for data preprocessing using the famous iris dataset. We will prepare the data, extract features, select the most relevant features, create new features, normalize and standardize the features, merge the features, evaluate their quality, realize the features, and integrate them for use in a machine learning model.
-
Data preparation: we will use the iris dataset from the
scikit-learn
library, which is already preprocessed and cleaned. -
Feature reading: we will use the following features from the dataset:
Sepal length
,Sepal width
,Petal length
,Petal width
. -
Feature selection: we will use the SelectKBest method from
scikit-learn
to select the top 2 most relevant features based on their mutual information score. -
Feature creation: we will create a new feature called
'Sepal to Petal Ratio'
by dividing the sepal length by the petal length. -
Standardization: we will use the StandardScaler method from
scikit-learn
to scale the selected features. -
Feature merging: we will merge the selected and newly created features into one array.
-
Feature evaluation: we will evaluate the quality of the features by calculating their correlation coefficients.
Features with high correlation are more linearly dependent and hence have almost the same effect on the dependent variable. Then, when two features have a high correlation, we can drop one of the two features. -
Integration and usage: finally, we will integrate the realized features into a machine-learning model for classification.
Note that there is a difference between feature selection and feature creation: feature selection refers to the process of selecting a subset of the available features in a dataset that is most relevant or informative for a given machine learning task. Feature creation, on the other hand, involves generating new features from the existing ones in order to capture more complex or abstract relationships between them.
# Import libraries from sklearn.datasets import load_iris from sklearn.feature_selection import SelectKBest, mutual_info_classif from sklearn.preprocessing import StandardScaler import numpy as np # Load dataset iris = load_iris() # Read features X = iris.data sepal_length = X[:, 0] sepal_width = X[:, 1] petal_length = X[:, 2] petal_width = X[:, 3] # Create new features sepal_to_petal_ratio = sepal_length / petal_length sepal_to_petal_ratio = np.reshape(sepal_to_petal_ratio, (-1, 1)) sepal_area = sepal_length * sepal_width petal_area = petal_length * petal_width ratio_sepal = sepal_length / sepal_width ratio_petal = petal_length / petal_width # Feature selection kbest = SelectKBest(mutual_info_classif, k=2) X_new = kbest.fit_transform(X, iris.target) # Feature creation X_new = np.hstack((X_new, sepal_to_petal_ratio)) # Scaling scaler = StandardScaler() X_new = scaler.fit_transform(X_new) # Feature merging X_new = np.hstack((X_new, sepal_area.reshape(-1, 1))) X_new = np.hstack((X_new, petal_area.reshape(-1, 1))) X_new = np.hstack((X_new, ratio_sepal.reshape(-1, 1))) X_new = np.hstack((X_new, ratio_petal.reshape(-1, 1))) # Feature evaluation correlation_matrix = np.corrcoef(X_new.T) print('Correlation Matrix:') print(correlation_matrix) X_new = np.array(X_new, np.float32) # Integration and usage # The realized features can now be used in a machine learning model for classification
Дякуємо за ваш відгук!