Scaling and Normalization
Numerical features in your data often have very different scales, which can hurt the performance of machine learning algorithms—especially those using distance calculations or assuming normal distributions. Scaling ensures all features contribute equally to model training.
The two main scaling techniques are:
- Normalization: rescales features to a fixed range, usually between
0and1; - Standardization: transforms features to have a mean of
0and a standard deviation of1.
Each method changes your data's range in a different way and is best suited to specific scenarios.
1234567891011121314151617181920212223242526272829import pandas as pd from sklearn.preprocessing import StandardScaler, MinMaxScaler # Load Titanic dataset from seaborn import seaborn as sns titanic = sns.load_dataset('titanic') # Select numerical features for scaling features = ['age', 'fare', 'sibsp', 'parch'] df = titanic[features].dropna() # Standardization scaler_standard = StandardScaler() df_standardized = pd.DataFrame( scaler_standard.fit_transform(df), columns=df.columns ) # Normalization scaler_minmax = MinMaxScaler() df_normalized = pd.DataFrame( scaler_minmax.fit_transform(df), columns=df.columns ) print("Standardized Data (first 5 rows):") print(df_standardized.head()) print("\nNormalized Data (first 5 rows):") print(df_normalized.head())
Standardization is best when your data follows a Gaussian (normal) distribution, or when algorithms expect centered data, such as linear regression, logistic regression, or k-means clustering.
Normalization is preferred when you want all features to have the same scale, especially for algorithms that use distance metrics, like k-nearest neighbors or neural networks.
¡Gracias por tus comentarios!
Pregunte a AI
Pregunte a AI
Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla
Genial!
Completion tasa mejorada a 8.33
Scaling and Normalization
Desliza para mostrar el menú
Numerical features in your data often have very different scales, which can hurt the performance of machine learning algorithms—especially those using distance calculations or assuming normal distributions. Scaling ensures all features contribute equally to model training.
The two main scaling techniques are:
- Normalization: rescales features to a fixed range, usually between
0and1; - Standardization: transforms features to have a mean of
0and a standard deviation of1.
Each method changes your data's range in a different way and is best suited to specific scenarios.
1234567891011121314151617181920212223242526272829import pandas as pd from sklearn.preprocessing import StandardScaler, MinMaxScaler # Load Titanic dataset from seaborn import seaborn as sns titanic = sns.load_dataset('titanic') # Select numerical features for scaling features = ['age', 'fare', 'sibsp', 'parch'] df = titanic[features].dropna() # Standardization scaler_standard = StandardScaler() df_standardized = pd.DataFrame( scaler_standard.fit_transform(df), columns=df.columns ) # Normalization scaler_minmax = MinMaxScaler() df_normalized = pd.DataFrame( scaler_minmax.fit_transform(df), columns=df.columns ) print("Standardized Data (first 5 rows):") print(df_standardized.head()) print("\nNormalized Data (first 5 rows):") print(df_normalized.head())
Standardization is best when your data follows a Gaussian (normal) distribution, or when algorithms expect centered data, such as linear regression, logistic regression, or k-means clustering.
Normalization is preferred when you want all features to have the same scale, especially for algorithms that use distance metrics, like k-nearest neighbors or neural networks.
¡Gracias por tus comentarios!