Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Kolmogorov–Smirnov (KS) Test | Statistical Drift Detection
Feature Drift and Data Drift Detection

bookKolmogorov–Smirnov (KS) Test

The Kolmogorov–Smirnov (KS) test is a non-parametric statistical method for checking if two samples come from the same distribution.

In feature drift detection, use the KS test to compare the distributions of a continuous feature from two datasets:

  • The reference dataset (historical or baseline data);
  • The current dataset (new or recent data).

The KS test works by measuring the maximum distance between the empirical cumulative distribution functions (ECDFs) of the two samples. A large distance suggests that the underlying distributions have changed, which may indicate drift.

How the KS Test Works

  • For each value in the combined dataset, calculate the proportion of samples in each group that are less than or equal to that value;
  • Find the largest difference between these proportions — this is the KS statistic;
  • Compare the KS statistic to a critical value, or use it to compute a p-value to determine if the difference is statistically significant.

If the KS statistic is large and the p-value is small, you have evidence of a change in the feature distribution.

12345678910111213
import numpy as np from scipy.stats import ks_2samp # Generate synthetic data for two periods np.random.seed(42) reference = np.random.normal(loc=0.0, scale=1.0, size=1000) current = np.random.normal(loc=0.5, scale=1.0, size=1000) # Apply the Kolmogorov–Smirnov test ks_stat, p_value = ks_2samp(reference, current) print("KS Statistic:", ks_stat) print("P-value:", p_value)
copy
Note
Note

The KS test is well-suited for detecting drift in continuous features because it makes no assumptions about the underlying distributions. However, it is not appropriate for categorical or discrete features, and its sensitivity can be affected by sample size. Additionally, the test is most powerful when the two samples are independent and free from ties (duplicate values).

When using the KS test for drift detection, you interpret the results by examining the p-value and the KS statistic. A small p-value (typically less than 0.05) indicates that the null hypothesis — that both samples come from the same distribution — can be rejected. This suggests that there is statistically significant drift in the feature distribution. The larger the KS statistic, the greater the difference between the two distributions. In practical monitoring, you would flag features with low p-values for further investigation or action.

Opgave

Swipe to start coding

You are given two numerical feature samples representing data distributions at different times: sample_ref (reference) and sample_new (current).

Your task is to perform a Kolmogorov–Smirnov (KS) test to check if the new data distribution significantly differs from the reference.

Steps:

  1. Import the required function from scipy.stats.
  2. Compute the KS statistic and p-value using the two samples.
  3. Compare the p-value to alpha = 0.05.
  4. If p_value < alpha, mark drift_detected = True; otherwise False.
  5. Print the KS statistic, p-value, and drift status.

Løsning

Var alt klart?

Hvordan kan vi forbedre det?

Tak for dine kommentarer!

Sektion 2. Kapitel 1
single

single

Spørg AI

expand

Spørg AI

ChatGPT

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

close

Awesome!

Completion rate improved to 11.11

bookKolmogorov–Smirnov (KS) Test

Stryg for at vise menuen

The Kolmogorov–Smirnov (KS) test is a non-parametric statistical method for checking if two samples come from the same distribution.

In feature drift detection, use the KS test to compare the distributions of a continuous feature from two datasets:

  • The reference dataset (historical or baseline data);
  • The current dataset (new or recent data).

The KS test works by measuring the maximum distance between the empirical cumulative distribution functions (ECDFs) of the two samples. A large distance suggests that the underlying distributions have changed, which may indicate drift.

How the KS Test Works

  • For each value in the combined dataset, calculate the proportion of samples in each group that are less than or equal to that value;
  • Find the largest difference between these proportions — this is the KS statistic;
  • Compare the KS statistic to a critical value, or use it to compute a p-value to determine if the difference is statistically significant.

If the KS statistic is large and the p-value is small, you have evidence of a change in the feature distribution.

12345678910111213
import numpy as np from scipy.stats import ks_2samp # Generate synthetic data for two periods np.random.seed(42) reference = np.random.normal(loc=0.0, scale=1.0, size=1000) current = np.random.normal(loc=0.5, scale=1.0, size=1000) # Apply the Kolmogorov–Smirnov test ks_stat, p_value = ks_2samp(reference, current) print("KS Statistic:", ks_stat) print("P-value:", p_value)
copy
Note
Note

The KS test is well-suited for detecting drift in continuous features because it makes no assumptions about the underlying distributions. However, it is not appropriate for categorical or discrete features, and its sensitivity can be affected by sample size. Additionally, the test is most powerful when the two samples are independent and free from ties (duplicate values).

When using the KS test for drift detection, you interpret the results by examining the p-value and the KS statistic. A small p-value (typically less than 0.05) indicates that the null hypothesis — that both samples come from the same distribution — can be rejected. This suggests that there is statistically significant drift in the feature distribution. The larger the KS statistic, the greater the difference between the two distributions. In practical monitoring, you would flag features with low p-values for further investigation or action.

Opgave

Swipe to start coding

You are given two numerical feature samples representing data distributions at different times: sample_ref (reference) and sample_new (current).

Your task is to perform a Kolmogorov–Smirnov (KS) test to check if the new data distribution significantly differs from the reference.

Steps:

  1. Import the required function from scipy.stats.
  2. Compute the KS statistic and p-value using the two samples.
  3. Compare the p-value to alpha = 0.05.
  4. If p_value < alpha, mark drift_detected = True; otherwise False.
  5. Print the KS statistic, p-value, and drift status.

Løsning

Switch to desktopSkift til skrivebord for at øve i den virkelige verdenFortsæt der, hvor du er, med en af nedenstående muligheder
Var alt klart?

Hvordan kan vi forbedre det?

Tak for dine kommentarer!

Sektion 2. Kapitel 1
single

single

some-alt