Population Stability Index (PSI)
The Population Stability Index (PSI) is a widely used metric for quantifying how much a variable's distribution has shifted between two samples—typically a baseline ("expected") and a new ("actual") population. PSI is commonly used in credit scoring and model monitoring to detect feature drift that could impact model performance.
How PSI Works
-
Binning:
- Divide the feature values in both samples into discrete intervals (bins).
- Bins can be defined by quantiles, equal widths, or domain knowledge.
- Always use the same bin edges for both samples to ensure a fair comparison.
-
Comparison:
- For each bin, calculate the proportion of observations in the expected and actual samples.
- For each bin, compute:
(expected proportion - actual proportion) * ln(expected proportion / actual proportion)
-
Summation:
- Sum the result across all bins to get the total PSI for that feature.
How to Interpret PSI Values
- PSI < 0.1: No significant change;
- PSI 0.1–0.25: Moderate drift;
- PSI > 0.25: Significant drift—potentially requiring model review or retraining.
Use PSI to monitor your features and quickly identify when a variable's distribution changes enough to affect your model's reliability.
1234567891011121314151617181920212223import numpy as np import pandas as pd def calculate_psi(expected, actual, bins=10): # Bin both distributions using the same bin edges bin_edges = np.histogram_bin_edges(np.concatenate([expected, actual]), bins=bins) expected_counts, _ = np.histogram(expected, bins=bin_edges) actual_counts, _ = np.histogram(actual, bins=bin_edges) # Convert counts to proportions, add small value to avoid division by zero expected_percents = expected_counts / len(expected) + 1e-8 actual_percents = actual_counts / len(actual) + 1e-8 # Calculate PSI for each bin psi_values = (expected_percents - actual_percents) * np.log(expected_percents / actual_percents) psi = np.sum(psi_values) return psi # Create synthetic data: expected (baseline) and actual (drifted) np.random.seed(0) expected_data = np.random.normal(loc=0, scale=1, size=1000) actual_data = np.random.normal(loc=0.5, scale=1.2, size=1000) psi_value = calculate_psi(expected_data, actual_data, bins=10) print("PSI for feature:", psi_value)
You can extend PSI calculation to multiple features by computing the PSI score for each feature individually, comparing their distributions between the baseline and new datasets. This allows you to monitor drift across your entire feature set. Once you have the PSI scores for all features, you can interpret them using the established thresholds: features with PSI above 0.25 may have changed enough to impact your model, while those between 0.1 and 0.25 should be watched for moderate drift. Features below 0.1 are considered stable. This approach helps you quickly identify which parts of your data pipeline may require further investigation or model updates.
Swipe to start coding
You are given two numerical feature samples representing a reference and a new dataset. Your task is to compute the Population Stability Index (PSI) to measure how much the distribution of the new data has shifted from the reference.
Steps:
- Create equally spaced bins (10 by default).
- Compute the histogram proportions for each dataset using
np.histogram(..., density=True)or similar. - For each bin, calculate: PSIi=(pi−qi)×ln(qipi)
- The final PSI is the sum of all per-bin values.
- Handle edge cases where probabilities are zero (use a small epsilon = 1e-6).
- Print both the bin-wise PSI values and the total PSI score.
Solução
Obrigado pelo seu feedback!
single
Pergunte à IA
Pergunte à IA
Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo
How can I use this PSI calculation for multiple features in a DataFrame?
What should I do if my PSI value is above 0.25 for a feature?
Can you explain how to choose the right number of bins for PSI calculation?
Awesome!
Completion rate improved to 11.11
Population Stability Index (PSI)
Deslize para mostrar o menu
The Population Stability Index (PSI) is a widely used metric for quantifying how much a variable's distribution has shifted between two samples—typically a baseline ("expected") and a new ("actual") population. PSI is commonly used in credit scoring and model monitoring to detect feature drift that could impact model performance.
How PSI Works
-
Binning:
- Divide the feature values in both samples into discrete intervals (bins).
- Bins can be defined by quantiles, equal widths, or domain knowledge.
- Always use the same bin edges for both samples to ensure a fair comparison.
-
Comparison:
- For each bin, calculate the proportion of observations in the expected and actual samples.
- For each bin, compute:
(expected proportion - actual proportion) * ln(expected proportion / actual proportion)
-
Summation:
- Sum the result across all bins to get the total PSI for that feature.
How to Interpret PSI Values
- PSI < 0.1: No significant change;
- PSI 0.1–0.25: Moderate drift;
- PSI > 0.25: Significant drift—potentially requiring model review or retraining.
Use PSI to monitor your features and quickly identify when a variable's distribution changes enough to affect your model's reliability.
1234567891011121314151617181920212223import numpy as np import pandas as pd def calculate_psi(expected, actual, bins=10): # Bin both distributions using the same bin edges bin_edges = np.histogram_bin_edges(np.concatenate([expected, actual]), bins=bins) expected_counts, _ = np.histogram(expected, bins=bin_edges) actual_counts, _ = np.histogram(actual, bins=bin_edges) # Convert counts to proportions, add small value to avoid division by zero expected_percents = expected_counts / len(expected) + 1e-8 actual_percents = actual_counts / len(actual) + 1e-8 # Calculate PSI for each bin psi_values = (expected_percents - actual_percents) * np.log(expected_percents / actual_percents) psi = np.sum(psi_values) return psi # Create synthetic data: expected (baseline) and actual (drifted) np.random.seed(0) expected_data = np.random.normal(loc=0, scale=1, size=1000) actual_data = np.random.normal(loc=0.5, scale=1.2, size=1000) psi_value = calculate_psi(expected_data, actual_data, bins=10) print("PSI for feature:", psi_value)
You can extend PSI calculation to multiple features by computing the PSI score for each feature individually, comparing their distributions between the baseline and new datasets. This allows you to monitor drift across your entire feature set. Once you have the PSI scores for all features, you can interpret them using the established thresholds: features with PSI above 0.25 may have changed enough to impact your model, while those between 0.1 and 0.25 should be watched for moderate drift. Features below 0.1 are considered stable. This approach helps you quickly identify which parts of your data pipeline may require further investigation or model updates.
Swipe to start coding
You are given two numerical feature samples representing a reference and a new dataset. Your task is to compute the Population Stability Index (PSI) to measure how much the distribution of the new data has shifted from the reference.
Steps:
- Create equally spaced bins (10 by default).
- Compute the histogram proportions for each dataset using
np.histogram(..., density=True)or similar. - For each bin, calculate: PSIi=(pi−qi)×ln(qipi)
- The final PSI is the sum of all per-bin values.
- Handle edge cases where probabilities are zero (use a small epsilon = 1e-6).
- Print both the bin-wise PSI values and the total PSI score.
Solução
Obrigado pelo seu feedback!
single