Feature Transformation and Extraction
Many real-world datasets contain features with skewed distributions, which can reduce the effectiveness of machine learning models. You can apply mathematical transformations to reduce skewness and improve data quality. Two common methods are:
- Logarithmic transformation: reduces strong positive skewness by applying
log(x); - Square root transformation: moderates smaller degrees of skewness using
sqrt(x).
These methods help make feature distributions more normal-like and enhance model performance.
123456789101112131415161718192021222324252627282930import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Load the Titanic dataset df = sns.load_dataset('titanic') fare = df['fare'] # Apply log transformation (add 1 to handle zeros) fare_log = np.log(fare + 1) # Create side-by-side histogram comparison fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Original fare axes[0].hist(fare, bins=50, color='skyblue', edgecolor='black', alpha=0.7) axes[0].set_xlabel('Fare ($)', fontsize=12) axes[0].set_ylabel('Frequency', fontsize=12) axes[0].set_title('Original Fare Distribution', fontsize=14, fontweight='bold') axes[0].grid(True, alpha=0.3) # Log-transformed fare axes[1].hist(fare_log, bins=50, color='lightcoral', edgecolor='black', alpha=0.7) axes[1].set_xlabel('Log(Fare + 1)', fontsize=12) axes[1].set_ylabel('Frequency', fontsize=12) axes[1].set_title('Log-Transformed Fare Distribution', fontsize=14, fontweight='bold') axes[1].grid(True, alpha=0.3) plt.tight_layout()
Feature extraction is the process of creating new features from raw data to improve the performance of machine learning models.
It helps by making important information more explicit, reducing noise, and sometimes lowering the dimensionality of the data. Effective feature extraction can lead to better predictions and more interpretable models.
1234567891011import seaborn as sns import pandas as pd # Load the Titanic dataset df = sns.load_dataset('titanic') # Create a new feature: family_size = sibsp + parch + 1 df['family_size'] = df['sibsp'] + df['parch'] + 1 # Show the first few rows with the new feature print(df[['sibsp', 'parch', 'family_size']].head())
Obrigado pelo seu feedback!
Pergunte à IA
Pergunte à IA
Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo
Incrível!
Completion taxa melhorada para 8.33
Feature Transformation and Extraction
Deslize para mostrar o menu
Many real-world datasets contain features with skewed distributions, which can reduce the effectiveness of machine learning models. You can apply mathematical transformations to reduce skewness and improve data quality. Two common methods are:
- Logarithmic transformation: reduces strong positive skewness by applying
log(x); - Square root transformation: moderates smaller degrees of skewness using
sqrt(x).
These methods help make feature distributions more normal-like and enhance model performance.
123456789101112131415161718192021222324252627282930import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Load the Titanic dataset df = sns.load_dataset('titanic') fare = df['fare'] # Apply log transformation (add 1 to handle zeros) fare_log = np.log(fare + 1) # Create side-by-side histogram comparison fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Original fare axes[0].hist(fare, bins=50, color='skyblue', edgecolor='black', alpha=0.7) axes[0].set_xlabel('Fare ($)', fontsize=12) axes[0].set_ylabel('Frequency', fontsize=12) axes[0].set_title('Original Fare Distribution', fontsize=14, fontweight='bold') axes[0].grid(True, alpha=0.3) # Log-transformed fare axes[1].hist(fare_log, bins=50, color='lightcoral', edgecolor='black', alpha=0.7) axes[1].set_xlabel('Log(Fare + 1)', fontsize=12) axes[1].set_ylabel('Frequency', fontsize=12) axes[1].set_title('Log-Transformed Fare Distribution', fontsize=14, fontweight='bold') axes[1].grid(True, alpha=0.3) plt.tight_layout()
Feature extraction is the process of creating new features from raw data to improve the performance of machine learning models.
It helps by making important information more explicit, reducing noise, and sometimes lowering the dimensionality of the data. Effective feature extraction can lead to better predictions and more interpretable models.
1234567891011import seaborn as sns import pandas as pd # Load the Titanic dataset df = sns.load_dataset('titanic') # Create a new feature: family_size = sibsp + parch + 1 df['family_size'] = df['sibsp'] + df['parch'] + 1 # Show the first few rows with the new feature print(df[['sibsp', 'parch', 'family_size']].head())
Obrigado pelo seu feedback!