Learn Monitoring Model and Data Drift | Monitoring and Continuous Delivery

Machine learning models in production face a dynamic environment where both the data and the underlying business context can change over time. Two key phenomena to watch for are model drift and data drift.

Model drift refers to the decline in model performance as the relationship between input features and the target variable changes. There are two main types of model drift:

Concept drift: the statistical relationship between features and the target variable changes over time; this means the model's underlying assumptions no longer hold, so predictions become less accurate;
Performance drift: the model's accuracy or other evaluation metrics degrade, even if the feature-target relationship appears stable; this can result from changes in external factors or evolving business objectives.

Data drift, on the other hand, occurs when the distribution of input data itself shifts from what the model was originally trained on. Data drift can be categorized as:

Covariate drift: the distribution of input features changes, but the relationship between features and target remains the same;
Prior probability drift: the distribution of the target variable changes, such as a shift in the proportion of classes in classification problems;
Feature distribution drift: specific input features experience changes in their statistical properties, such as mean or variance, which may impact model predictions.

Monitoring for these changes is essential: if you do not detect drift, your model's predictions may become unreliable, leading to poor business outcomes or even critical failures in automated decision systems. Effective monitoring lets you catch these issues early and trigger retraining, model updates, or deeper investigations as needed.

Definition

Model drift occurs when a model's performance degrades due to changes in data distribution.


              123456789101112131415161718192021222324252627
            
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import ks_2samp

# Simulated training data and recent production data
np.random.seed(42)
training_feature = np.random.normal(loc=0, scale=1, size=1000)
recent_feature = np.random.normal(loc=0.5, scale=1.2, size=1000)

# Plot distributions
plt.figure(figsize=(10, 5))
plt.hist(training_feature, bins=30, alpha=0.5, label="Training Data", density=True)
plt.hist(recent_feature, bins=30, alpha=0.5, label="Recent Data", density=True)
plt.legend()
plt.title("Feature Distribution: Training vs. Recent Data")
plt.xlabel("Feature Value")
plt.ylabel("Density")
plt.show()

# Use Kolmogorov-Smirnov test to compare distributions
statistic, p_value = ks_2samp(training_feature, recent_feature)
print(f"KS Statistic: {statistic:.3f}, p-value: {p_value:.3f}")

if p_value < 0.05:
    print("Significant data drift detected.")
else:
    print("No significant data drift detected.")

Everything was clear?

Thanks for your feedback!

Section 5. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain how the Kolmogorov-Smirnov test works for detecting data drift?

What are some other methods to monitor data or model drift in production?

How should I respond if significant data drift is detected in my model?

Awesome!

Completion rate improved to 6.25

Swipe to show menu