Aprenda Population Stability Index (PSI) | Statistical Drift Detection

The Population Stability Index (PSI) is a widely used metric for quantifying how much a variable's distribution has shifted between two samples—typically a baseline ("expected") and a new ("actual") population. PSI is commonly used in credit scoring and model monitoring to detect feature drift that could impact model performance.

How PSI Works

Binning:
- Divide the feature values in both samples into discrete intervals (bins).
- Bins can be defined by quantiles, equal widths, or domain knowledge.
- Always use the same bin edges for both samples to ensure a fair comparison.
Comparison:
- For each bin, calculate the proportion of observations in the expected and actual samples.
- For each bin, compute:
  (expected proportion - actual proportion) * ln(expected proportion / actual proportion)
Summation:
- Sum the result across all bins to get the total PSI for that feature.

How to Interpret PSI Values

PSI < 0.1: No significant change;
PSI 0.1–0.25: Moderate drift;
PSI > 0.25: Significant drift—potentially requiring model review or retraining.

Use PSI to monitor your features and quickly identify when a variable's distribution changes enough to affect your model's reliability.


              1234567891011121314151617181920212223
            
import numpy as np
import pandas as pd

def calculate_psi(expected, actual, bins=10):
    # Bin both distributions using the same bin edges
    bin_edges = np.histogram_bin_edges(np.concatenate([expected, actual]), bins=bins)
    expected_counts, _ = np.histogram(expected, bins=bin_edges)
    actual_counts, _ = np.histogram(actual, bins=bin_edges)
    # Convert counts to proportions, add small value to avoid division by zero
    expected_percents = expected_counts / len(expected) + 1e-8
    actual_percents = actual_counts / len(actual) + 1e-8
    # Calculate PSI for each bin
    psi_values = (expected_percents - actual_percents) * np.log(expected_percents / actual_percents)
    psi = np.sum(psi_values)
    return psi

# Create synthetic data: expected (baseline) and actual (drifted)
np.random.seed(0)
expected_data = np.random.normal(loc=0, scale=1, size=1000)
actual_data = np.random.normal(loc=0.5, scale=1.2, size=1000)

psi_value = calculate_psi(expected_data, actual_data, bins=10)
print("PSI for feature:", psi_value)

You can extend PSI calculation to multiple features by computing the PSI score for each feature individually, comparing their distributions between the baseline and new datasets. This allows you to monitor drift across your entire feature set. Once you have the PSI scores for all features, you can interpret them using the established thresholds: features with PSI above 0.25 may have changed enough to impact your model, while those between 0.1 and 0.25 should be watched for moderate drift. Features below 0.1 are considered stable. This approach helps you quickly identify which parts of your data pipeline may require further investigation or model updates.

Tarefa

Swipe to start coding

You are given two numerical feature samples representing a reference and a new dataset. Your task is to compute the Population Stability Index (PSI) to measure how much the distribution of the new data has shifted from the reference.

Steps:

Create equally spaced bins (10 by default).
Compute the histogram proportions for each dataset using np.histogram(..., density=True) or similar.
For each bin, calculate: $\text{PSI}_i = (p_i - q_i) \times \ln\left(\frac{p_i}{q_i}\right)$
The final PSI is the sum of all per-bin values.
Handle edge cases where probabilities are zero (use a small epsilon = 1e-6).
Print both the bin-wise PSI values and the total PSI score.

Solução

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 2. Capítulo 2

single

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Suggested prompts:

How can I use this PSI calculation for multiple features in a DataFrame?

What should I do if my PSI value is above 0.25 for a feature?

Can you explain how to choose the right number of bins for PSI calculation?

Deslize para mostrar o menu