Learn Other Statistical Methods | Statistical Drift Detection

When you need to detect drift in datasets, choosing the right statistical method is crucial. The Chi-Square test is commonly used for categorical features, as it measures whether the distribution of observed categories has changed significantly between two samples. In contrast, the Anderson–Darling test is particularly sensitive to differences in the tails of distributions and is well-suited for continuous features, especially when you suspect that drift may occur in the extremes rather than the center of the distribution.

Note

Use the Chi-Square test for categorical features, such as colors or product types, where you want to compare the frequency of each category. Choose the Anderson–Darling test for continuous features when you need sensitivity to changes in the tails of the distribution, such as rare but significant events.


              1234567891011121314151617181920
            
import numpy as np
from scipy.stats import chi2_contingency

# Synthetic categorical data: observed frequencies for two time periods
# Categories: ['Red', 'Blue', 'Green']
sample1 = [30, 50, 20]   # Reference period counts
sample2 = [25, 55, 20]   # Current period counts

# Build contingency table
contingency_table = np.array([sample1, sample2])

# Apply Chi-Square test
chi2, p, dof, expected = chi2_contingency(contingency_table)

print("Chi-Square statistic:", chi2)
print("p-value:", p)
if p < 0.05:
    print("Significant drift detected in categorical feature.")
else:
    print("No significant drift detected in categorical feature.")

While both the Chi-Square and Anderson–Darling tests are valuable for drift detection, their applicability depends on your data's characteristics. The Chi-Square test is limited to categorical data and requires a sufficient sample size in each category to produce reliable results. It cannot capture subtle changes in the distribution shape—only differences in category frequencies. The Anderson–Darling test, on the other hand, is designed for continuous data and excels at detecting shifts in the tails, making it more sensitive than the Kolmogorov–Smirnov test for certain types of drift. However, it is not suitable for categorical variables and may be overly sensitive to outliers in small samples.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Swipe to show menu

Note


              1234567891011121314151617181920
            
import numpy as np
from scipy.stats import chi2_contingency

# Synthetic categorical data: observed frequencies for two time periods
# Categories: ['Red', 'Blue', 'Green']
sample1 = [30, 50, 20]   # Reference period counts
sample2 = [25, 55, 20]   # Current period counts

# Build contingency table
contingency_table = np.array([sample1, sample2])

# Apply Chi-Square test
chi2, p, dof, expected = chi2_contingency(contingency_table)

print("Chi-Square statistic:", chi2)
print("p-value:", p)
if p < 0.05:
    print("Significant drift detected in categorical feature.")
else:
    print("No significant drift detected in categorical feature.")

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 3