Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Feature Transformation and Extraction | Section
Practice
Projects
Quizzes & Challenges
Quizze
Challenges
/
Data Preprocessing and Feature Engineering

bookFeature Transformation and Extraction

Many real-world datasets contain features with skewed distributions, which can reduce the effectiveness of machine learning models. You can apply mathematical transformations to reduce skewness and improve data quality. Two common methods are:

  • Logarithmic transformation: reduces strong positive skewness by applying log(x);
  • Square root transformation: moderates smaller degrees of skewness using sqrt(x).

These methods help make feature distributions more normal-like and enhance model performance.

123456789101112131415161718192021222324252627282930
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Load the Titanic dataset df = sns.load_dataset('titanic') fare = df['fare'] # Apply log transformation (add 1 to handle zeros) fare_log = np.log(fare + 1) # Create side-by-side histogram comparison fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Original fare axes[0].hist(fare, bins=50, color='skyblue', edgecolor='black', alpha=0.7) axes[0].set_xlabel('Fare ($)', fontsize=12) axes[0].set_ylabel('Frequency', fontsize=12) axes[0].set_title('Original Fare Distribution', fontsize=14, fontweight='bold') axes[0].grid(True, alpha=0.3) # Log-transformed fare axes[1].hist(fare_log, bins=50, color='lightcoral', edgecolor='black', alpha=0.7) axes[1].set_xlabel('Log(Fare + 1)', fontsize=12) axes[1].set_ylabel('Frequency', fontsize=12) axes[1].set_title('Log-Transformed Fare Distribution', fontsize=14, fontweight='bold') axes[1].grid(True, alpha=0.3) plt.tight_layout()
copy
Note
Definition

Feature extraction is the process of creating new features from raw data to improve the performance of machine learning models.

It helps by making important information more explicit, reducing noise, and sometimes lowering the dimensionality of the data. Effective feature extraction can lead to better predictions and more interpretable models.

1234567891011
import seaborn as sns import pandas as pd # Load the Titanic dataset df = sns.load_dataset('titanic') # Create a new feature: family_size = sibsp + parch + 1 df['family_size'] = df['sibsp'] + df['parch'] + 1 # Show the first few rows with the new feature print(df[['sibsp', 'parch', 'family_size']].head())
copy
question mark

Which transformation would be most appropriate for a variable with strong positive skewness and only positive values?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 1. Kapitel 7

Fragen Sie AI

expand

Fragen Sie AI

ChatGPT

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

bookFeature Transformation and Extraction

Swipe um das Menü anzuzeigen

Many real-world datasets contain features with skewed distributions, which can reduce the effectiveness of machine learning models. You can apply mathematical transformations to reduce skewness and improve data quality. Two common methods are:

  • Logarithmic transformation: reduces strong positive skewness by applying log(x);
  • Square root transformation: moderates smaller degrees of skewness using sqrt(x).

These methods help make feature distributions more normal-like and enhance model performance.

123456789101112131415161718192021222324252627282930
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Load the Titanic dataset df = sns.load_dataset('titanic') fare = df['fare'] # Apply log transformation (add 1 to handle zeros) fare_log = np.log(fare + 1) # Create side-by-side histogram comparison fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Original fare axes[0].hist(fare, bins=50, color='skyblue', edgecolor='black', alpha=0.7) axes[0].set_xlabel('Fare ($)', fontsize=12) axes[0].set_ylabel('Frequency', fontsize=12) axes[0].set_title('Original Fare Distribution', fontsize=14, fontweight='bold') axes[0].grid(True, alpha=0.3) # Log-transformed fare axes[1].hist(fare_log, bins=50, color='lightcoral', edgecolor='black', alpha=0.7) axes[1].set_xlabel('Log(Fare + 1)', fontsize=12) axes[1].set_ylabel('Frequency', fontsize=12) axes[1].set_title('Log-Transformed Fare Distribution', fontsize=14, fontweight='bold') axes[1].grid(True, alpha=0.3) plt.tight_layout()
copy
Note
Definition

Feature extraction is the process of creating new features from raw data to improve the performance of machine learning models.

It helps by making important information more explicit, reducing noise, and sometimes lowering the dimensionality of the data. Effective feature extraction can lead to better predictions and more interpretable models.

1234567891011
import seaborn as sns import pandas as pd # Load the Titanic dataset df = sns.load_dataset('titanic') # Create a new feature: family_size = sibsp + parch + 1 df['family_size'] = df['sibsp'] + df['parch'] + 1 # Show the first few rows with the new feature print(df[['sibsp', 'parch', 'family_size']].head())
copy
question mark

Which transformation would be most appropriate for a variable with strong positive skewness and only positive values?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 1. Kapitel 7
some-alt