Lära Common scikit-learn Anti-Patterns | Introspection, Reproducibility, and Anti-Patterns

Svep för att visa menyn

When working with scikit-learn, it is easy to fall into certain traps that can undermine your results or make your workflow difficult to maintain. Some of the most common anti-patterns include:

Fitting on test data: this happens when you use the test set during training, either by accident or through improper workflow design. This leads to overly optimistic performance estimates and poor generalization;
Leaking information: information from the test set or future data can accidentally make its way into the training process, often through preprocessing steps performed outside of proper workflows;
Manual preprocessing outside pipelines: if you scale, encode, or otherwise transform your data manually before passing it to a model, you risk inconsistencies and leaks, especially during cross-validation or in production.

Each of these anti-patterns can lead to invalid results or models that do not perform as expected when deployed. Understanding how these mistakes occur is the first step to avoiding them.


              12345678910111213141516171819202122232425262728293031
            
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# --- Anti-pattern: Manual preprocessing causing data leakage ---
# Fit scaler on the entire dataset (including test data)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Data leakage!

# Train/test split AFTER scaling: test data influenced by training
X_train_scaled, X_test_scaled = train_test_split(X_scaled, random_state=42)

clf = LogisticRegression(max_iter=200)
clf.fit(X_train_scaled, y_train)
print("Anti-pattern test score:", clf.score(X_test_scaled, y_test))

# --- Refactored: Use a Pipeline to avoid leakage ---
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(max_iter=200))
])

pipe.fit(X_train, y_train)
print("Pipeline test score:", pipe.score(X_test, y_test))

To avoid these anti-patterns, always use scikit-learn's Pipeline and related workflow tools. This ensures that all preprocessing steps are properly fitted only on the training data, and that no information from the test set leaks into the model during training. By integrating every transformation and estimator into a pipeline, you keep your workflow reproducible and safe from subtle mistakes. The refactored code above demonstrates how a pipeline encapsulates both scaling and modeling, applying them in the correct order and only on the appropriate data. Consistently using pipelines and careful splitting of data will help you maintain robust, reliable machine learning workflows.

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 5. Kapitel 3

Fråga AI

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Avsnitt 5. Kapitel 3