Lære Kolmogorov–Smirnov (KS) Test | Statistical Drift Detection

The Kolmogorov–Smirnov (KS) test is a non-parametric statistical method for checking if two samples come from the same distribution.

In feature drift detection, use the KS test to compare the distributions of a continuous feature from two datasets:

The reference dataset (historical or baseline data);
The current dataset (new or recent data).

The KS test works by measuring the maximum distance between the empirical cumulative distribution functions (ECDFs) of the two samples. A large distance suggests that the underlying distributions have changed, which may indicate drift.

How the KS Test Works

For each value in the combined dataset, calculate the proportion of samples in each group that are less than or equal to that value;
Find the largest difference between these proportions — this is the KS statistic;
Compare the KS statistic to a critical value, or use it to compute a p-value to determine if the difference is statistically significant.

If the KS statistic is large and the p-value is small, you have evidence of a change in the feature distribution.


              12345678910111213
            
import numpy as np
from scipy.stats import ks_2samp

# Generate synthetic data for two periods
np.random.seed(42)
reference = np.random.normal(loc=0.0, scale=1.0, size=1000)
current = np.random.normal(loc=0.5, scale=1.0, size=1000)

# Apply the Kolmogorov–Smirnov test
ks_stat, p_value = ks_2samp(reference, current)

print("KS Statistic:", ks_stat)
print("P-value:", p_value)

Note

The KS test is well-suited for detecting drift in continuous features because it makes no assumptions about the underlying distributions. However, it is not appropriate for categorical or discrete features, and its sensitivity can be affected by sample size. Additionally, the test is most powerful when the two samples are independent and free from ties (duplicate values).

When using the KS test for drift detection, you interpret the results by examining the p-value and the KS statistic. A small p-value (typically less than 0.05) indicates that the null hypothesis — that both samples come from the same distribution — can be rejected. This suggests that there is statistically significant drift in the feature distribution. The larger the KS statistic, the greater the difference between the two distributions. In practical monitoring, you would flag features with low p-values for further investigation or action.

Opgave

Swipe to start coding

You are given two numerical feature samples representing data distributions at different times: sample_ref (reference) and sample_new (current).

Your task is to perform a Kolmogorov–Smirnov (KS) test to check if the new data distribution significantly differs from the reference.

Steps:

Import the required function from scipy.stats.
Compute the KS statistic and p-value using the two samples.
Compare the p-value to alpha = 0.05.
If p_value < alpha, mark drift_detected = True; otherwise False.
Print the KS statistic, p-value, and drift status.

Løsning

Var alt klart?

Tak for dine kommentarer!

Sektion 2. Kapitel 1

single

Spørg AI

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

Suggested prompts:

Can you explain how to interpret the KS statistic and p-value in more detail?

What are some limitations of using the KS test for feature drift detection?

How can I use the KS test with real-world datasets instead of synthetic data?

Stryg for at vise menuen