Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Classical Statistical Approaches | Statistical and Distance-Based Methods
Outlier and Novelty Detection in Python

bookClassical Statistical Approaches

Sveip for å vise menyen

Classical statistical approaches provide foundational techniques for detecting outliers by leveraging the mathematical properties of data distributions. You will explore three essential methods: Z-score, interquartile range (IQR), and Mahalanobis distance. Each method offers a different perspective, suited to various data types and structures.

The Z-score measures how many standard deviations a data point is from the mean. For a data point xx, the Z-score is calculated as:

Z=xμσZ = \frac{x - \mu}{\sigma}

where μ\mu is the sample mean and σ\sigma is the standard deviation. A large absolute Z-score indicates that a point is far from the mean, potentially an outlier.

The IQR is based on percentiles and is robust to non-normal distributions. It is defined as the range between the 75th percentile (Q3Q_3) and the 25th percentile (Q1Q_1):

IQR=Q3Q1\text{IQR} = Q_3 - Q_1

Points that fall Q11.5×IQRQ_1 - 1.5 \times \text{IQR} or above Q3+1.5×IQRQ_3 + 1.5 \times \text{IQR} are flagged as outliers.

Mahalanobis distance extends the concept of distance to multivariate data, accounting for the covariance between features. For a vector xx and data with mean μ\mu and covariance matrix $S$, the Mahalanobis distance is:

DM(x)=(xμ)TS1(xμ)D_M(x) = \sqrt{(x - \mu)^T S^{-1} (x - \mu)}

A larger Mahalanobis distance indicates that a point is far from the mean in the context of the data's spread and correlation structure.

Note
Note

Assumptions and Limitations:

  • Z-score assumes the data is normally distributed; it can be misleading for skewed or heavy-tailed distributions;
  • IQR is robust to outliers and non-normality but may miss extreme values in small samples or multi-modal data;
  • Mahalanobis distance assumes multivariate normality and requires reliable estimation of the covariance matrix; it is sensitive to correlated features and can be unstable if the number of features is close to the number of samples.
1234567891011121314151617181920212223
import numpy as np # 1D data array data = np.array([10, 12, 12, 13, 12, 14, 13, 100, 12, 11, 13, 10, 12]) # Z-score calculation mean = np.mean(data) std = np.std(data) z_scores = (data - mean) / std # Flag outliers with Z-score > 3 or < -3 z_outliers = np.where(np.abs(z_scores) > 3)[0] # IQR calculation q1 = np.percentile(data, 25) q3 = np.percentile(data, 75) iqr = q3 - q1 lower_bound = q1 - 1.5 * iqr upper_bound = q3 + 1.5 * iqr iqr_outliers = np.where((data < lower_bound) | (data > upper_bound))[0] print("Z-score outlier indices:", z_outliers) print("IQR outlier indices:", iqr_outliers)
copy
12345678910111213141516171819202122
import numpy as np # 2D data array (each row is a sample) X = np.array([ [2, 3], [3, 5], [4, 4], [5, 7], [100, 200] # outlier ]) # Calculate mean and covariance mean = np.mean(X, axis=0) cov = np.cov(X, rowvar=False) inv_cov = np.linalg.inv(cov) # Mahalanobis distance calculation for each point diff = X - mean mdist = np.sqrt(np.sum(diff @ inv_cov * diff, axis=1)) print("Mahalanobis distances:", mdist) print("Outlier likely at index with highest distance:", np.argmax(mdist))
copy
question mark

Which method is most appropriate for detecting outliers in a dataset with two highly correlated features and why?

Select the correct answer

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 2. Kapittel 1

Spør AI

expand

Spør AI

ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 2. Kapittel 1
some-alt