Model-Based Detection
Model-based drift detection is a practical way to identify changes in data distributions using machine learning classifiers.
- Train a classifier—such as logistic regression—to distinguish between reference (historical) data and current (incoming) data;
- Combine both datasets and assign labels:
0for reference and1for current; - The classifier learns to detect systematic differences between the two groups;
- If the classifier separates the datasets well, it indicates a shift or drift has occurred between the distributions.
This model-based approach is especially valuable for complex or high-dimensional data, where traditional statistical tests may not be sensitive enough to detect subtle changes.
12345678910111213141516171819202122232425262728293031323334import numpy as np import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score, accuracy_score from sklearn.model_selection import train_test_split # Generate synthetic reference data (normal distribution) np.random.seed(42) reference = np.random.normal(loc=0, scale=1, size=(500, 2)) reference_labels = np.zeros(reference.shape[0]) # Generate synthetic current data (drifted: shifted mean) current = np.random.normal(loc=1.5, scale=1, size=(500, 2)) current_labels = np.ones(current.shape[0]) # Combine datasets X = np.vstack([reference, current]) y = np.concatenate([reference_labels, current_labels]) # Split into train/test for the drift detector X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Fit logistic regression classifier clf = LogisticRegression() clf.fit(X_train, y_train) # Predict and evaluate y_pred = clf.predict(X_test) y_pred_proba = clf.predict_proba(X_test)[:, 1] accuracy = accuracy_score(y_test, y_pred) auc = roc_auc_score(y_test, y_pred_proba) print("Drift Detector Accuracy:", accuracy) print("Drift Detector AUC:", auc)
The performance of the classifier—measured by metrics such as accuracy and AUC (Area Under the ROC Curve)—directly reflects the presence of drift:
- If the classifier achieves high accuracy or high AUC when distinguishing between reference and current data; it indicates the two distributions are different enough for the model to separate them;
- If the classifier performs close to random guessing (accuracy or AUC near
0.5); it suggests little to no detectable drift.
This model-based approach provides a flexible and scalable way to monitor for distribution shifts, especially in complex or high-dimensional feature spaces where traditional statistical tests may fail to capture subtle changes.
Swipe to start coding
You're given two unlabeled datasets from different periods: a reference sample and a current sample. Treat “dataset origin” as a binary label (0 = reference, 1 = current). Train a simple classifier to predict origin; if the model separates them well, distribution shift is likely.
Steps:
- Generate synthetic
refandnewdata (given). - Build domain labels
y_domain(0 forref, 1 fornew) and stack intoX. - Split into train/test (
test_size=0.3,random_state=42). - Train
LogisticRegression(max_iter=1000, random_state=0). - Get probabilities on test set; compute
auc_score = roc_auc_score(...). - Set
drift_detected = (auc_score >= 0.65)and print shapes, AUC, flag.
Solução
Obrigado pelo seu feedback!
single
Pergunte à IA
Pergunte à IA
Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo
Can you explain how to interpret the accuracy and AUC values in this context?
What are some limitations of using model-based drift detection?
How can I apply this approach to my own dataset?
Awesome!
Completion rate improved to 11.11
Model-Based Detection
Deslize para mostrar o menu
Model-based drift detection is a practical way to identify changes in data distributions using machine learning classifiers.
- Train a classifier—such as logistic regression—to distinguish between reference (historical) data and current (incoming) data;
- Combine both datasets and assign labels:
0for reference and1for current; - The classifier learns to detect systematic differences between the two groups;
- If the classifier separates the datasets well, it indicates a shift or drift has occurred between the distributions.
This model-based approach is especially valuable for complex or high-dimensional data, where traditional statistical tests may not be sensitive enough to detect subtle changes.
12345678910111213141516171819202122232425262728293031323334import numpy as np import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score, accuracy_score from sklearn.model_selection import train_test_split # Generate synthetic reference data (normal distribution) np.random.seed(42) reference = np.random.normal(loc=0, scale=1, size=(500, 2)) reference_labels = np.zeros(reference.shape[0]) # Generate synthetic current data (drifted: shifted mean) current = np.random.normal(loc=1.5, scale=1, size=(500, 2)) current_labels = np.ones(current.shape[0]) # Combine datasets X = np.vstack([reference, current]) y = np.concatenate([reference_labels, current_labels]) # Split into train/test for the drift detector X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Fit logistic regression classifier clf = LogisticRegression() clf.fit(X_train, y_train) # Predict and evaluate y_pred = clf.predict(X_test) y_pred_proba = clf.predict_proba(X_test)[:, 1] accuracy = accuracy_score(y_test, y_pred) auc = roc_auc_score(y_test, y_pred_proba) print("Drift Detector Accuracy:", accuracy) print("Drift Detector AUC:", auc)
The performance of the classifier—measured by metrics such as accuracy and AUC (Area Under the ROC Curve)—directly reflects the presence of drift:
- If the classifier achieves high accuracy or high AUC when distinguishing between reference and current data; it indicates the two distributions are different enough for the model to separate them;
- If the classifier performs close to random guessing (accuracy or AUC near
0.5); it suggests little to no detectable drift.
This model-based approach provides a flexible and scalable way to monitor for distribution shifts, especially in complex or high-dimensional feature spaces where traditional statistical tests may fail to capture subtle changes.
Swipe to start coding
You're given two unlabeled datasets from different periods: a reference sample and a current sample. Treat “dataset origin” as a binary label (0 = reference, 1 = current). Train a simple classifier to predict origin; if the model separates them well, distribution shift is likely.
Steps:
- Generate synthetic
refandnewdata (given). - Build domain labels
y_domain(0 forref, 1 fornew) and stack intoX. - Split into train/test (
test_size=0.3,random_state=42). - Train
LogisticRegression(max_iter=1000, random_state=0). - Get probabilities on test set; compute
auc_score = roc_auc_score(...). - Set
drift_detected = (auc_score >= 0.65)and print shapes, AUC, flag.
Solução
Obrigado pelo seu feedback!
single