Lære Model Training Pipelines

Stryg for at vise menuen

A well-structured model training pipeline is essential in MLOps, allowing you to build, test, and deploy machine learning models efficiently and reliably. The core principle of modular pipeline design is to divide the workflow into clear, reusable components. Typically, you separate the pipeline into distinct stages: data loading, preprocessing, model training, and evaluation. Each stage is handled by a dedicated function or module, making the pipeline easier to understand, debug, and extend. This approach ensures that changes in one part of the pipeline, such as switching to a new preprocessing method, do not disrupt the entire workflow. By keeping data loading, preprocessing, training, and evaluation logically distinct, you can quickly adapt your pipeline to new datasets, algorithms, or evaluation criteria.

A critical rule in MLOps pipelines is that data splitting must happen before preprocessing. Fitting the scaler on the entire dataset before splitting would expose the model to statistical information from the test set during training — a problem known as data leakage. To prevent this, always split the data first, then fit the scaler exclusively on the training set and use it to transform both the training and test sets.


              12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
            
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import joblib

# Data loading function
def load_data():
    iris = load_iris(as_frame=True)
    X = iris.data
    y = iris.target
    return X, y

# Preprocessing function: fit scaler on training data only
def preprocess_data(X_train, X_test):
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)  # Fitting only on training data
    X_test_scaled = scaler.transform(X_test)  # Applying the same scaler to test data
    return X_train_scaled, X_test_scaled, scaler

# Training function
def train_model(X_train, y_train):
    model = LogisticRegression(max_iter=200)
    model.fit(X_train, y_train)
    return model

# Evaluation function: returns metrics for tracking across runs
def evaluate_model(model, X_train, y_train, X_test, y_test):
    train_accuracy = accuracy_score(y_train, model.predict(X_train))
    test_accuracy = accuracy_score(y_test, model.predict(X_test))
    print(f"Train accuracy: {train_accuracy:.2f}")
    print(f"Test accuracy: {test_accuracy:.2f}")
    return {"train_accuracy": train_accuracy, "test_accuracy": test_accuracy}

# Pipeline execution
X, y = load_data()

# Splitting before preprocessing to avoid data leakage
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train_scaled, X_test_scaled, scaler = preprocess_data(X_train, X_test)
model = train_model(X_train_scaled, y_train)
metrics = evaluate_model(model, X_train_scaled, y_train, X_test_scaled, y_test)

# Saving the model and scaler for production deployment
joblib.dump(model, "model.pkl")
joblib.dump(scaler, "scaler.pkl")

Automating your model training pipeline brings significant benefits for both scalability and maintainability. Automated pipelines reduce manual intervention, allowing you to train and evaluate models on new data or with new configurations quickly and consistently. This is crucial when scaling up to handle larger datasets or more frequent experiments. Modular pipelines also make maintenance easier: you can update or replace individual components, such as swapping out a preprocessing step, without rewriting the entire workflow. This flexibility supports rapid iteration and robust collaboration, both of which are key in production-grade machine learning systems.

Var alt klart?

Tak for dine kommentarer!

Sektion 1. Kapitel 5

Spørg AI

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

Sektion 1. Kapitel 5