Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Model Training Pipelines | Section
MLOps Fundamentals with Python

bookModel Training Pipelines

Stryg for at vise menuen

A well-structured model training pipeline is essential in MLOps, allowing you to build, test, and deploy machine learning models efficiently and reliably. The core principle of modular pipeline design is to divide the workflow into clear, reusable components. Typically, you separate the pipeline into distinct stages: data loading, preprocessing, model training, and evaluation. Each stage is handled by a dedicated function or module, making the pipeline easier to understand, debug, and extend. This approach ensures that changes in one part of the pipeline, such as switching to a new preprocessing method, do not disrupt the entire workflow. By keeping data loading, preprocessing, training, and evaluation logically distinct, you can quickly adapt your pipeline to new datasets, algorithms, or evaluation criteria.

A critical rule in MLOps pipelines is that data splitting must happen before preprocessing. Fitting the scaler on the entire dataset before splitting would expose the model to statistical information from the test set during training — a problem known as data leakage. To prevent this, always split the data first, then fit the scaler exclusively on the training set and use it to transform both the training and test sets.

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
import pandas as pd from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score import joblib # Data loading function def load_data(): iris = load_iris(as_frame=True) X = iris.data y = iris.target return X, y # Preprocessing function: fit scaler on training data only def preprocess_data(X_train, X_test): scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) # Fitting only on training data X_test_scaled = scaler.transform(X_test) # Applying the same scaler to test data return X_train_scaled, X_test_scaled, scaler # Training function def train_model(X_train, y_train): model = LogisticRegression(max_iter=200) model.fit(X_train, y_train) return model # Evaluation function: returns metrics for tracking across runs def evaluate_model(model, X_train, y_train, X_test, y_test): train_accuracy = accuracy_score(y_train, model.predict(X_train)) test_accuracy = accuracy_score(y_test, model.predict(X_test)) print(f"Train accuracy: {train_accuracy:.2f}") print(f"Test accuracy: {test_accuracy:.2f}") return {"train_accuracy": train_accuracy, "test_accuracy": test_accuracy} # Pipeline execution X, y = load_data() # Splitting before preprocessing to avoid data leakage X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) X_train_scaled, X_test_scaled, scaler = preprocess_data(X_train, X_test) model = train_model(X_train_scaled, y_train) metrics = evaluate_model(model, X_train_scaled, y_train, X_test_scaled, y_test) # Saving the model and scaler for production deployment joblib.dump(model, "model.pkl") joblib.dump(scaler, "scaler.pkl")
copy

Automating your model training pipeline brings significant benefits for both scalability and maintainability. Automated pipelines reduce manual intervention, allowing you to train and evaluate models on new data or with new configurations quickly and consistently. This is crucial when scaling up to handle larger datasets or more frequent experiments. Modular pipelines also make maintenance easier: you can update or replace individual components, such as swapping out a preprocessing step, without rewriting the entire workflow. This flexibility supports rapid iteration and robust collaboration, both of which are key in production-grade machine learning systems.

question mark

Why is a modular pipeline design beneficial in MLOps?

Vælg alle korrekte svar

Var alt klart?

Hvordan kan vi forbedre det?

Tak for dine kommentarer!

Sektion 1. Kapitel 5

Spørg AI

expand

Spørg AI

ChatGPT

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

Sektion 1. Kapitel 5
some-alt