Learn What Is Drift | Understanding Drift

Swipe to show menu

In machine learning, drift refers to a change in the underlying data or relationships that a model relies on to make predictions. There are three main types of drift you should understand: data drift, feature drift, and concept drift.

Definition

Data drift is a broad term that describes any change in the statistical properties of the input data over time. This might mean the overall distribution of the dataset has shifted, which can affect model performance even if the relationships between features and targets remain the same.

Definition

Feature drift is a more specific case where the distribution of one or more individual features changes. For example, the average age of customers in your dataset might increase over time, or the range of values for a sensor reading might shift.

Definition

Concept drift occurs when the relationship between input features and the target variable changes. This means that even if the input data appears similar, the way it maps to the output has changed. For instance, if a model predicts whether an email is spam, but spammers start using new tactics, the features that once indicated spam may no longer be reliable.

Understanding the differences between these types of drift is crucial for maintaining reliable machine learning pipelines. If you do not monitor for drift, your models can become less accurate, leading to poor decisions and outcomes.

Note

Common causes of drift include:

Temporal changes: data naturally evolves over time;
Sampling bias: data collection methods or sources change, introducing new patterns;
Behavioral shifts: users, customers, or systems change their behavior, leading to new data trends.


              12345678910111213141516
            
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic feature data for two time periods
np.random.seed(42)
feature_period1 = np.random.normal(loc=50, scale=5, size=1000)
feature_period2 = np.random.normal(loc=55, scale=7, size=1000)

plt.figure(figsize=(8, 5))
plt.hist(feature_period1, bins=30, alpha=0.6, label="Period 1", color="blue", density=True)
plt.hist(feature_period2, bins=30, alpha=0.6, label="Period 2", color="orange", density=True)
plt.title("Feature Distribution Over Time")
plt.xlabel("Feature Value")
plt.ylabel("Density")
plt.legend()
plt.show()

You can often spot feature drift by visually comparing feature distributions from different time periods, as in the plot above. If the shapes, centers, or spreads of the distributions change noticeably, this is a strong indicator of drift. For example, if the histogram for "Period 2" is shifted to the right and has a wider spread than "Period 1", it means the feature's average value and variability have both changed. Such changes can impact your model's predictions and may require retraining or adjustment.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 1