Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Classical Statistical Approaches | Statistical and Distance-Based Methods
Outlier and Novelty Detection in Practice

bookClassical Statistical Approaches

Classical statistical approaches provide foundational techniques for detecting outliers by leveraging the mathematical properties of data distributions. You will explore three essential methods: Z-score, interquartile range (IQR), and Mahalanobis distance. Each method offers a different perspective, suited to various data types and structures.

The Z-score measures how many standard deviations a data point is from the mean. For a data point xx, the Z-score is calculated as:

Z=xμσZ = \frac{x - \mu}{\sigma}

where μ\mu is the sample mean and σ\sigma is the standard deviation. A large absolute Z-score indicates that a point is far from the mean, potentially an outlier.

The IQR is based on percentiles and is robust to non-normal distributions. It is defined as the range between the 75th percentile (Q3Q_3) and the 25th percentile (Q1Q_1):

IQR=Q3Q1\text{IQR} = Q_3 - Q_1

Points that fall Q11.5×IQRQ_1 - 1.5 \times \text{IQR} or above Q3+1.5×IQRQ_3 + 1.5 \times \text{IQR} are flagged as outliers.

Mahalanobis distance extends the concept of distance to multivariate data, accounting for the covariance between features. For a vector xx and data with mean μ\mu and covariance matrix $S$, the Mahalanobis distance is:

DM(x)=(xμ)TS1(xμ)D_M(x) = \sqrt{(x - \mu)^T S^{-1} (x - \mu)}

A larger Mahalanobis distance indicates that a point is far from the mean in the context of the data's spread and correlation structure.

Note
Note

Assumptions and Limitations:

  • Z-score assumes the data is normally distributed; it can be misleading for skewed or heavy-tailed distributions;
  • IQR is robust to outliers and non-normality but may miss extreme values in small samples or multi-modal data;
  • Mahalanobis distance assumes multivariate normality and requires reliable estimation of the covariance matrix; it is sensitive to correlated features and can be unstable if the number of features is close to the number of samples.
1234567891011121314151617181920212223
import numpy as np # 1D data array data = np.array([10, 12, 12, 13, 12, 14, 13, 100, 12, 11, 13, 10, 12]) # Z-score calculation mean = np.mean(data) std = np.std(data) z_scores = (data - mean) / std # Flag outliers with Z-score > 3 or < -3 z_outliers = np.where(np.abs(z_scores) > 3)[0] # IQR calculation q1 = np.percentile(data, 25) q3 = np.percentile(data, 75) iqr = q3 - q1 lower_bound = q1 - 1.5 * iqr upper_bound = q3 + 1.5 * iqr iqr_outliers = np.where((data < lower_bound) | (data > upper_bound))[0] print("Z-score outlier indices:", z_outliers) print("IQR outlier indices:", iqr_outliers)
copy
12345678910111213141516171819202122
import numpy as np # 2D data array (each row is a sample) X = np.array([ [2, 3], [3, 5], [4, 4], [5, 7], [100, 200] # outlier ]) # Calculate mean and covariance mean = np.mean(X, axis=0) cov = np.cov(X, rowvar=False) inv_cov = np.linalg.inv(cov) # Mahalanobis distance calculation for each point diff = X - mean mdist = np.sqrt(np.sum(diff @ inv_cov * diff, axis=1)) print("Mahalanobis distances:", mdist) print("Outlier likely at index with highest distance:", np.argmax(mdist))
copy
question mark

Which method is most appropriate for detecting outliers in a dataset with two highly correlated features and why?

Select the correct answer

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 2. Kapittel 1

Spør AI

expand

Spør AI

ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Suggested prompts:

Can you explain when to use each outlier detection method?

What are the limitations of these statistical approaches?

How do I interpret the results from the code samples?

Awesome!

Completion rate improved to 4.55

bookClassical Statistical Approaches

Sveip for å vise menyen

Classical statistical approaches provide foundational techniques for detecting outliers by leveraging the mathematical properties of data distributions. You will explore three essential methods: Z-score, interquartile range (IQR), and Mahalanobis distance. Each method offers a different perspective, suited to various data types and structures.

The Z-score measures how many standard deviations a data point is from the mean. For a data point xx, the Z-score is calculated as:

Z=xμσZ = \frac{x - \mu}{\sigma}

where μ\mu is the sample mean and σ\sigma is the standard deviation. A large absolute Z-score indicates that a point is far from the mean, potentially an outlier.

The IQR is based on percentiles and is robust to non-normal distributions. It is defined as the range between the 75th percentile (Q3Q_3) and the 25th percentile (Q1Q_1):

IQR=Q3Q1\text{IQR} = Q_3 - Q_1

Points that fall Q11.5×IQRQ_1 - 1.5 \times \text{IQR} or above Q3+1.5×IQRQ_3 + 1.5 \times \text{IQR} are flagged as outliers.

Mahalanobis distance extends the concept of distance to multivariate data, accounting for the covariance between features. For a vector xx and data with mean μ\mu and covariance matrix $S$, the Mahalanobis distance is:

DM(x)=(xμ)TS1(xμ)D_M(x) = \sqrt{(x - \mu)^T S^{-1} (x - \mu)}

A larger Mahalanobis distance indicates that a point is far from the mean in the context of the data's spread and correlation structure.

Note
Note

Assumptions and Limitations:

  • Z-score assumes the data is normally distributed; it can be misleading for skewed or heavy-tailed distributions;
  • IQR is robust to outliers and non-normality but may miss extreme values in small samples or multi-modal data;
  • Mahalanobis distance assumes multivariate normality and requires reliable estimation of the covariance matrix; it is sensitive to correlated features and can be unstable if the number of features is close to the number of samples.
1234567891011121314151617181920212223
import numpy as np # 1D data array data = np.array([10, 12, 12, 13, 12, 14, 13, 100, 12, 11, 13, 10, 12]) # Z-score calculation mean = np.mean(data) std = np.std(data) z_scores = (data - mean) / std # Flag outliers with Z-score > 3 or < -3 z_outliers = np.where(np.abs(z_scores) > 3)[0] # IQR calculation q1 = np.percentile(data, 25) q3 = np.percentile(data, 75) iqr = q3 - q1 lower_bound = q1 - 1.5 * iqr upper_bound = q3 + 1.5 * iqr iqr_outliers = np.where((data < lower_bound) | (data > upper_bound))[0] print("Z-score outlier indices:", z_outliers) print("IQR outlier indices:", iqr_outliers)
copy
12345678910111213141516171819202122
import numpy as np # 2D data array (each row is a sample) X = np.array([ [2, 3], [3, 5], [4, 4], [5, 7], [100, 200] # outlier ]) # Calculate mean and covariance mean = np.mean(X, axis=0) cov = np.cov(X, rowvar=False) inv_cov = np.linalg.inv(cov) # Mahalanobis distance calculation for each point diff = X - mean mdist = np.sqrt(np.sum(diff @ inv_cov * diff, axis=1)) print("Mahalanobis distances:", mdist) print("Outlier likely at index with highest distance:", np.argmax(mdist))
copy
question mark

Which method is most appropriate for detecting outliers in a dataset with two highly correlated features and why?

Select the correct answer

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 2. Kapittel 1
some-alt