Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Oppiskele Avoiding Data Leakage in Preprocessing | Choosing and Evaluating Techniques
Feature Scaling and Normalization Deep Dive

bookAvoiding Data Leakage in Preprocessing

When you apply feature scaling or normalization to your data, it's crucial to avoid data leakage—a subtle but serious issue that can compromise the validity of your machine learning models. Data leakage occurs when information from outside the training dataset is used to create the model, leading to overly optimistic performance estimates that won't generalize to new, unseen data.

A common scenario where leakage happens is during the preprocessing stage. Suppose you have a dataset and you want to scale its features using a technique like z-score standardization. If you compute the mean and standard deviation on the entire dataset—including both training and test sets—before splitting, you inadvertently let information from the test set influence the scaling parameters. This means your model has seen some information from the test set before evaluation, which can bias its performance.

To illustrate, imagine you are working with a dataset of house prices. If you calculate the scaling parameters (such as the mean or maximum value) using all available data, including the test portion, your model will be tuned to patterns that it should not have access to during training. This can result in an inflated sense of model accuracy during testing.

The correct workflow is to split your data into training and test sets first. Then, you fit your scaler (such as a StandardScaler or MinMaxScaler) only on the training data. After fitting, you use the learned parameters to transform both the training and test sets. This ensures that the test data remains truly unseen by the model and its preprocessing steps, providing a realistic measure of model performance.

Note
Definition

A train-test split is the process of dividing your dataset into two subsets: one for training the model and one for testing it. This split helps ensure that the model's evaluation reflects its ability to generalize to new, unseen data, and is a key step in preventing data leakage during preprocessing.

1234567891011121314151617181920212223242526
import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # Create a synthetic dataset X = np.array([[1.0, 200.0], [2.0, 300.0], [3.0, 400.0], [4.0, 500.0], [5.0, 600.0]]) # Split into training and test sets X_train, X_test = train_test_split(X, test_size=0.4, random_state=42) # Initialize the scaler scaler = StandardScaler() # Fit scaler only on training data scaler.fit(X_train) # Transform both training and test sets using the fitted scaler X_train_scaled = scaler.transform(X_train) X_test_scaled = scaler.transform(X_test) print("Training set after scaling:\n", X_train_scaled) print("Test set after scaling:\n", X_test_scaled)
copy
question mark

Which of the following practices best prevents data leakage when scaling features for a machine learning model?

Select the correct answer

Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 5. Luku 2

Kysy tekoälyä

expand

Kysy tekoälyä

ChatGPT

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Awesome!

Completion rate improved to 5.26

bookAvoiding Data Leakage in Preprocessing

Pyyhkäise näyttääksesi valikon

When you apply feature scaling or normalization to your data, it's crucial to avoid data leakage—a subtle but serious issue that can compromise the validity of your machine learning models. Data leakage occurs when information from outside the training dataset is used to create the model, leading to overly optimistic performance estimates that won't generalize to new, unseen data.

A common scenario where leakage happens is during the preprocessing stage. Suppose you have a dataset and you want to scale its features using a technique like z-score standardization. If you compute the mean and standard deviation on the entire dataset—including both training and test sets—before splitting, you inadvertently let information from the test set influence the scaling parameters. This means your model has seen some information from the test set before evaluation, which can bias its performance.

To illustrate, imagine you are working with a dataset of house prices. If you calculate the scaling parameters (such as the mean or maximum value) using all available data, including the test portion, your model will be tuned to patterns that it should not have access to during training. This can result in an inflated sense of model accuracy during testing.

The correct workflow is to split your data into training and test sets first. Then, you fit your scaler (such as a StandardScaler or MinMaxScaler) only on the training data. After fitting, you use the learned parameters to transform both the training and test sets. This ensures that the test data remains truly unseen by the model and its preprocessing steps, providing a realistic measure of model performance.

Note
Definition

A train-test split is the process of dividing your dataset into two subsets: one for training the model and one for testing it. This split helps ensure that the model's evaluation reflects its ability to generalize to new, unseen data, and is a key step in preventing data leakage during preprocessing.

1234567891011121314151617181920212223242526
import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # Create a synthetic dataset X = np.array([[1.0, 200.0], [2.0, 300.0], [3.0, 400.0], [4.0, 500.0], [5.0, 600.0]]) # Split into training and test sets X_train, X_test = train_test_split(X, test_size=0.4, random_state=42) # Initialize the scaler scaler = StandardScaler() # Fit scaler only on training data scaler.fit(X_train) # Transform both training and test sets using the fitted scaler X_train_scaled = scaler.transform(X_train) X_test_scaled = scaler.transform(X_test) print("Training set after scaling:\n", X_train_scaled) print("Test set after scaling:\n", X_test_scaled)
copy
question mark

Which of the following practices best prevents data leakage when scaling features for a machine learning model?

Select the correct answer

Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 5. Luku 2
some-alt