Oppiskele Avoiding Data Leakage in Preprocessing | Choosing and Evaluating Techniques

When you apply feature scaling or normalization to your data, it's crucial to avoid data leakage—a subtle but serious issue that can compromise the validity of your machine learning models. Data leakage occurs when information from outside the training dataset is used to create the model, leading to overly optimistic performance estimates that won't generalize to new, unseen data.

A common scenario where leakage happens is during the preprocessing stage. Suppose you have a dataset and you want to scale its features using a technique like z-score standardization. If you compute the mean and standard deviation on the entire dataset—including both training and test sets—before splitting, you inadvertently let information from the test set influence the scaling parameters. This means your model has seen some information from the test set before evaluation, which can bias its performance.

To illustrate, imagine you are working with a dataset of house prices. If you calculate the scaling parameters (such as the mean or maximum value) using all available data, including the test portion, your model will be tuned to patterns that it should not have access to during training. This can result in an inflated sense of model accuracy during testing.

The correct workflow is to split your data into training and test sets first. Then, you fit your scaler (such as a StandardScaler or MinMaxScaler) only on the training data. After fitting, you use the learned parameters to transform both the training and test sets. This ensures that the test data remains truly unseen by the model and its preprocessing steps, providing a realistic measure of model performance.

Definition

A train-test split is the process of dividing your dataset into two subsets: one for training the model and one for testing it. This split helps ensure that the model's evaluation reflects its ability to generalize to new, unseen data, and is a key step in preventing data leakage during preprocessing.


              1234567891011121314151617181920212223242526
            
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Create a synthetic dataset
X = np.array([[1.0, 200.0],
              [2.0, 300.0],
              [3.0, 400.0],
              [4.0, 500.0],
              [5.0, 600.0]])

# Split into training and test sets
X_train, X_test = train_test_split(X, test_size=0.4, random_state=42)

# Initialize the scaler
scaler = StandardScaler()

# Fit scaler only on training data
scaler.fit(X_train)

# Transform both training and test sets using the fitted scaler
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Training set after scaling:\n", X_train_scaled)
print("Test set after scaling:\n", X_test_scaled)

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 5. Luku 2

Kysy tekoälyä

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Suggested prompts:

Can you explain why it's important to fit the scaler only on the training data?

What would happen if I accidentally fit the scaler on the entire dataset?

Are there other preprocessing steps where data leakage can occur?

Pyyhkäise näyttääksesi valikon

Definition


              1234567891011121314151617181920212223242526
            
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Create a synthetic dataset
X = np.array([[1.0, 200.0],
              [2.0, 300.0],
              [3.0, 400.0],
              [4.0, 500.0],
              [5.0, 600.0]])

# Split into training and test sets
X_train, X_test = train_test_split(X, test_size=0.4, random_state=42)

# Initialize the scaler
scaler = StandardScaler()

# Fit scaler only on training data
scaler.fit(X_train)

# Transform both training and test sets using the fitted scaler
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Training set after scaling:\n", X_train_scaled)
print("Test set after scaling:\n", X_test_scaled)

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 5. Luku 2