Aprenda Model-Based Detection | Model-Based Monitoring

Model-based drift detection is a practical way to identify changes in data distributions using machine learning classifiers.

Train a classifier—such as logistic regression—to distinguish between reference (historical) data and current (incoming) data;
Combine both datasets and assign labels: 0 for reference and 1 for current;
The classifier learns to detect systematic differences between the two groups;
If the classifier separates the datasets well, it indicates a shift or drift has occurred between the distributions.

This model-based approach is especially valuable for complex or high-dimensional data, where traditional statistical tests may not be sensitive enough to detect subtle changes.


              12345678910111213141516171819202122232425262728293031323334
            
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.model_selection import train_test_split

# Generate synthetic reference data (normal distribution)
np.random.seed(42)
reference = np.random.normal(loc=0, scale=1, size=(500, 2))
reference_labels = np.zeros(reference.shape[0])

# Generate synthetic current data (drifted: shifted mean)
current = np.random.normal(loc=1.5, scale=1, size=(500, 2))
current_labels = np.ones(current.shape[0])

# Combine datasets
X = np.vstack([reference, current])
y = np.concatenate([reference_labels, current_labels])

# Split into train/test for the drift detector
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit logistic regression classifier
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = clf.predict(X_test)
y_pred_proba = clf.predict_proba(X_test)[:, 1]
accuracy = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred_proba)

print("Drift Detector Accuracy:", accuracy)
print("Drift Detector AUC:", auc)

The performance of the classifier—measured by metrics such as accuracy and AUC (Area Under the ROC Curve)—directly reflects the presence of drift:

If the classifier achieves high accuracy or high AUC when distinguishing between reference and current data; it indicates the two distributions are different enough for the model to separate them;
If the classifier performs close to random guessing (accuracy or AUC near 0.5); it suggests little to no detectable drift.

This model-based approach provides a flexible and scalable way to monitor for distribution shifts, especially in complex or high-dimensional feature spaces where traditional statistical tests may fail to capture subtle changes.

Tarefa

Swipe to start coding

You're given two unlabeled datasets from different periods: a reference sample and a current sample. Treat “dataset origin” as a binary label (0 = reference, 1 = current). Train a simple classifier to predict origin; if the model separates them well, distribution shift is likely.

Steps:

Generate synthetic ref and new data (given).
Build domain labels y_domain (0 for ref, 1 for new) and stack into X.
Split into train/test (test_size=0.3, random_state=42).
Train LogisticRegression(max_iter=1000, random_state=0).
Get probabilities on test set; compute auc_score = roc_auc_score(...).
Set drift_detected = (auc_score >= 0.65) and print shapes, AUC, flag.

Solução

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 3. Capítulo 1

single

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Deslize para mostrar o menu