Oppiskele Classical Statistical Approaches | Statistical and Distance-Based Methods

Classical statistical approaches provide foundational techniques for detecting outliers by leveraging the mathematical properties of data distributions. You will explore three essential methods: Z-score, interquartile range (IQR), and Mahalanobis distance. Each method offers a different perspective, suited to various data types and structures.

The Z-score measures how many standard deviations a data point is from the mean. For a data point $x$ , the Z-score is calculated as:

Z = \frac{x - \mu}{\sigma}

where $\mu$ is the sample mean and $\sigma$ is the standard deviation. A large absolute Z-score indicates that a point is far from the mean, potentially an outlier.

The IQR is based on percentiles and is robust to non-normal distributions. It is defined as the range between the 75th percentile ( $Q_3$ ) and the 25th percentile ( $Q_1$ ):

\text{IQR} = Q_3 - Q_1

Points that fall $Q_1 - 1.5 \times \text{IQR}$ or above $Q_3 + 1.5 \times \text{IQR}$ are flagged as outliers.

Mahalanobis distance extends the concept of distance to multivariate data, accounting for the covariance between features. For a vector $x$ and data with mean $\mu$ and covariance matrix $S$, the Mahalanobis distance is:

D_M(x) = \sqrt{(x - \mu)^T S^{-1} (x - \mu)}

A larger Mahalanobis distance indicates that a point is far from the mean in the context of the data's spread and correlation structure.

Note

Assumptions and Limitations:

Z-score assumes the data is normally distributed; it can be misleading for skewed or heavy-tailed distributions;
IQR is robust to outliers and non-normality but may miss extreme values in small samples or multi-modal data;
Mahalanobis distance assumes multivariate normality and requires reliable estimation of the covariance matrix; it is sensitive to correlated features and can be unstable if the number of features is close to the number of samples.


              1234567891011121314151617181920212223
            
import numpy as np

# 1D data array
data = np.array([10, 12, 12, 13, 12, 14, 13, 100, 12, 11, 13, 10, 12])

# Z-score calculation
mean = np.mean(data)
std = np.std(data)
z_scores = (data - mean) / std

# Flag outliers with Z-score > 3 or < -3
z_outliers = np.where(np.abs(z_scores) > 3)[0]

# IQR calculation
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
iqr_outliers = np.where((data < lower_bound) | (data > upper_bound))[0]

print("Z-score outlier indices:", z_outliers)
print("IQR outlier indices:", iqr_outliers)


              12345678910111213141516171819202122
            
import numpy as np

# 2D data array (each row is a sample)
X = np.array([
    [2, 3],
    [3, 5],
    [4, 4],
    [5, 7],
    [100, 200]  # outlier
])

# Calculate mean and covariance
mean = np.mean(X, axis=0)
cov = np.cov(X, rowvar=False)
inv_cov = np.linalg.inv(cov)

# Mahalanobis distance calculation for each point
diff = X - mean
mdist = np.sqrt(np.sum(diff @ inv_cov * diff, axis=1))

print("Mahalanobis distances:", mdist)
print("Outlier likely at index with highest distance:", np.argmax(mdist))

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 2. Luku 1

Kysy tekoälyä

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Suggested prompts:

Can you explain when to use each outlier detection method?

What are the limitations of these statistical approaches?

How do I interpret the results from the code samples?

Pyyhkäise näyttääksesi valikon