Kolmogorov–Smirnov (KS) Test
The Kolmogorov–Smirnov (KS) test is a non-parametric statistical method for checking if two samples come from the same distribution.
In feature drift detection, use the KS test to compare the distributions of a continuous feature from two datasets:
- The reference dataset (historical or baseline data);
- The current dataset (new or recent data).
The KS test works by measuring the maximum distance between the empirical cumulative distribution functions (ECDFs) of the two samples. A large distance suggests that the underlying distributions have changed, which may indicate drift.
How the KS Test Works
- For each value in the combined dataset, calculate the proportion of samples in each group that are less than or equal to that value;
- Find the largest difference between these proportions — this is the KS statistic;
- Compare the KS statistic to a critical value, or use it to compute a p-value to determine if the difference is statistically significant.
If the KS statistic is large and the p-value is small, you have evidence of a change in the feature distribution.
12345678910111213import numpy as np from scipy.stats import ks_2samp # Generate synthetic data for two periods np.random.seed(42) reference = np.random.normal(loc=0.0, scale=1.0, size=1000) current = np.random.normal(loc=0.5, scale=1.0, size=1000) # Apply the Kolmogorov–Smirnov test ks_stat, p_value = ks_2samp(reference, current) print("KS Statistic:", ks_stat) print("P-value:", p_value)
The KS test is well-suited for detecting drift in continuous features because it makes no assumptions about the underlying distributions. However, it is not appropriate for categorical or discrete features, and its sensitivity can be affected by sample size. Additionally, the test is most powerful when the two samples are independent and free from ties (duplicate values).
When using the KS test for drift detection, you interpret the results by examining the p-value and the KS statistic. A small p-value (typically less than 0.05) indicates that the null hypothesis — that both samples come from the same distribution — can be rejected. This suggests that there is statistically significant drift in the feature distribution. The larger the KS statistic, the greater the difference between the two distributions. In practical monitoring, you would flag features with low p-values for further investigation or action.
Swipe to start coding
You are given two numerical feature samples representing data distributions at different times:
sample_ref (reference) and sample_new (current).
Your task is to perform a Kolmogorov–Smirnov (KS) test to check if the new data distribution significantly differs from the reference.
Steps:
- Import the required function from
scipy.stats. - Compute the KS statistic and p-value using the two samples.
- Compare the p-value to
alpha = 0.05. - If
p_value < alpha, markdrift_detected = True; otherwiseFalse. - Print the KS statistic, p-value, and drift status.
Solução
Obrigado pelo seu feedback!
single
Pergunte à IA
Pergunte à IA
Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo
Awesome!
Completion rate improved to 11.11
Kolmogorov–Smirnov (KS) Test
Deslize para mostrar o menu
The Kolmogorov–Smirnov (KS) test is a non-parametric statistical method for checking if two samples come from the same distribution.
In feature drift detection, use the KS test to compare the distributions of a continuous feature from two datasets:
- The reference dataset (historical or baseline data);
- The current dataset (new or recent data).
The KS test works by measuring the maximum distance between the empirical cumulative distribution functions (ECDFs) of the two samples. A large distance suggests that the underlying distributions have changed, which may indicate drift.
How the KS Test Works
- For each value in the combined dataset, calculate the proportion of samples in each group that are less than or equal to that value;
- Find the largest difference between these proportions — this is the KS statistic;
- Compare the KS statistic to a critical value, or use it to compute a p-value to determine if the difference is statistically significant.
If the KS statistic is large and the p-value is small, you have evidence of a change in the feature distribution.
12345678910111213import numpy as np from scipy.stats import ks_2samp # Generate synthetic data for two periods np.random.seed(42) reference = np.random.normal(loc=0.0, scale=1.0, size=1000) current = np.random.normal(loc=0.5, scale=1.0, size=1000) # Apply the Kolmogorov–Smirnov test ks_stat, p_value = ks_2samp(reference, current) print("KS Statistic:", ks_stat) print("P-value:", p_value)
The KS test is well-suited for detecting drift in continuous features because it makes no assumptions about the underlying distributions. However, it is not appropriate for categorical or discrete features, and its sensitivity can be affected by sample size. Additionally, the test is most powerful when the two samples are independent and free from ties (duplicate values).
When using the KS test for drift detection, you interpret the results by examining the p-value and the KS statistic. A small p-value (typically less than 0.05) indicates that the null hypothesis — that both samples come from the same distribution — can be rejected. This suggests that there is statistically significant drift in the feature distribution. The larger the KS statistic, the greater the difference between the two distributions. In practical monitoring, you would flag features with low p-values for further investigation or action.
Swipe to start coding
You are given two numerical feature samples representing data distributions at different times:
sample_ref (reference) and sample_new (current).
Your task is to perform a Kolmogorov–Smirnov (KS) test to check if the new data distribution significantly differs from the reference.
Steps:
- Import the required function from
scipy.stats. - Compute the KS statistic and p-value using the two samples.
- Compare the p-value to
alpha = 0.05. - If
p_value < alpha, markdrift_detected = True; otherwiseFalse. - Print the KS statistic, p-value, and drift status.
Solução
Obrigado pelo seu feedback!
single