Aprenda Feature Transformation and Extraction

Deslize para mostrar o menu

Many real-world datasets contain features with skewed distributions, which can reduce the effectiveness of machine learning models. You can apply mathematical transformations to reduce skewness and improve data quality. Two common methods are:

Logarithmic transformation: reduces strong positive skewness by applying log(x);
Square root transformation: moderates smaller degrees of skewness using sqrt(x).

These methods help make feature distributions more normal-like and enhance model performance.


              123456789101112131415161718192021222324252627282930
            
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Titanic dataset
df = sns.load_dataset('titanic')
fare = df['fare']

# Apply log transformation (add 1 to handle zeros)
fare_log = np.log(fare + 1)

# Create side-by-side histogram comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Original fare
axes[0].hist(fare, bins=50, color='skyblue', edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Fare ($)', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Original Fare Distribution', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Log-transformed fare
axes[1].hist(fare_log, bins=50, color='lightcoral', edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Log(Fare + 1)', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Log-Transformed Fare Distribution', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()

Definition

Feature extraction is the process of creating new features from raw data to improve the performance of machine learning models.

It helps by making important information more explicit, reducing noise, and sometimes lowering the dimensionality of the data. Effective feature extraction can lead to better predictions and more interpretable models.


              1234567891011
            
import seaborn as sns
import pandas as pd

# Load the Titanic dataset
df = sns.load_dataset('titanic')

# Create a new feature: family_size = sibsp + parch + 1
df['family_size'] = df['sibsp'] + df['parch'] + 1

# Show the first few rows with the new feature
print(df[['sibsp', 'parch', 'family_size']].head())

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 1. Capítulo 7

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Seção 1. Capítulo 7