Classical Statistical Approaches
Classical statistical approaches provide foundational techniques for detecting outliers by leveraging the mathematical properties of data distributions. You will explore three essential methods: Z-score, interquartile range (IQR), and Mahalanobis distance. Each method offers a different perspective, suited to various data types and structures.
The Z-score measures how many standard deviations a data point is from the mean. For a data point x, the Z-score is calculated as:
Z=σx−μwhere μ is the sample mean and σ is the standard deviation. A large absolute Z-score indicates that a point is far from the mean, potentially an outlier.
The IQR is based on percentiles and is robust to non-normal distributions. It is defined as the range between the 75th percentile (Q3) and the 25th percentile (Q1):
IQR=Q3−Q1Points that fall Q1−1.5×IQR or above Q3+1.5×IQR are flagged as outliers.
Mahalanobis distance extends the concept of distance to multivariate data, accounting for the covariance between features. For a vector x and data with mean μ and covariance matrix $S$, the Mahalanobis distance is:
DM(x)=(x−μ)TS−1(x−μ)A larger Mahalanobis distance indicates that a point is far from the mean in the context of the data's spread and correlation structure.
Assumptions and Limitations:
- Z-score assumes the data is normally distributed; it can be misleading for skewed or heavy-tailed distributions;
- IQR is robust to outliers and non-normality but may miss extreme values in small samples or multi-modal data;
- Mahalanobis distance assumes multivariate normality and requires reliable estimation of the covariance matrix; it is sensitive to correlated features and can be unstable if the number of features is close to the number of samples.
1234567891011121314151617181920212223import numpy as np # 1D data array data = np.array([10, 12, 12, 13, 12, 14, 13, 100, 12, 11, 13, 10, 12]) # Z-score calculation mean = np.mean(data) std = np.std(data) z_scores = (data - mean) / std # Flag outliers with Z-score > 3 or < -3 z_outliers = np.where(np.abs(z_scores) > 3)[0] # IQR calculation q1 = np.percentile(data, 25) q3 = np.percentile(data, 75) iqr = q3 - q1 lower_bound = q1 - 1.5 * iqr upper_bound = q3 + 1.5 * iqr iqr_outliers = np.where((data < lower_bound) | (data > upper_bound))[0] print("Z-score outlier indices:", z_outliers) print("IQR outlier indices:", iqr_outliers)
12345678910111213141516171819202122import numpy as np # 2D data array (each row is a sample) X = np.array([ [2, 3], [3, 5], [4, 4], [5, 7], [100, 200] # outlier ]) # Calculate mean and covariance mean = np.mean(X, axis=0) cov = np.cov(X, rowvar=False) inv_cov = np.linalg.inv(cov) # Mahalanobis distance calculation for each point diff = X - mean mdist = np.sqrt(np.sum(diff @ inv_cov * diff, axis=1)) print("Mahalanobis distances:", mdist) print("Outlier likely at index with highest distance:", np.argmax(mdist))
Takk for tilbakemeldingene dine!
Spør AI
Spør AI
Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår
Can you explain when to use each outlier detection method?
What are the limitations of these statistical approaches?
How do I interpret the results from the code samples?
Awesome!
Completion rate improved to 4.55
Classical Statistical Approaches
Sveip for å vise menyen
Classical statistical approaches provide foundational techniques for detecting outliers by leveraging the mathematical properties of data distributions. You will explore three essential methods: Z-score, interquartile range (IQR), and Mahalanobis distance. Each method offers a different perspective, suited to various data types and structures.
The Z-score measures how many standard deviations a data point is from the mean. For a data point x, the Z-score is calculated as:
Z=σx−μwhere μ is the sample mean and σ is the standard deviation. A large absolute Z-score indicates that a point is far from the mean, potentially an outlier.
The IQR is based on percentiles and is robust to non-normal distributions. It is defined as the range between the 75th percentile (Q3) and the 25th percentile (Q1):
IQR=Q3−Q1Points that fall Q1−1.5×IQR or above Q3+1.5×IQR are flagged as outliers.
Mahalanobis distance extends the concept of distance to multivariate data, accounting for the covariance between features. For a vector x and data with mean μ and covariance matrix $S$, the Mahalanobis distance is:
DM(x)=(x−μ)TS−1(x−μ)A larger Mahalanobis distance indicates that a point is far from the mean in the context of the data's spread and correlation structure.
Assumptions and Limitations:
- Z-score assumes the data is normally distributed; it can be misleading for skewed or heavy-tailed distributions;
- IQR is robust to outliers and non-normality but may miss extreme values in small samples or multi-modal data;
- Mahalanobis distance assumes multivariate normality and requires reliable estimation of the covariance matrix; it is sensitive to correlated features and can be unstable if the number of features is close to the number of samples.
1234567891011121314151617181920212223import numpy as np # 1D data array data = np.array([10, 12, 12, 13, 12, 14, 13, 100, 12, 11, 13, 10, 12]) # Z-score calculation mean = np.mean(data) std = np.std(data) z_scores = (data - mean) / std # Flag outliers with Z-score > 3 or < -3 z_outliers = np.where(np.abs(z_scores) > 3)[0] # IQR calculation q1 = np.percentile(data, 25) q3 = np.percentile(data, 75) iqr = q3 - q1 lower_bound = q1 - 1.5 * iqr upper_bound = q3 + 1.5 * iqr iqr_outliers = np.where((data < lower_bound) | (data > upper_bound))[0] print("Z-score outlier indices:", z_outliers) print("IQR outlier indices:", iqr_outliers)
12345678910111213141516171819202122import numpy as np # 2D data array (each row is a sample) X = np.array([ [2, 3], [3, 5], [4, 4], [5, 7], [100, 200] # outlier ]) # Calculate mean and covariance mean = np.mean(X, axis=0) cov = np.cov(X, rowvar=False) inv_cov = np.linalg.inv(cov) # Mahalanobis distance calculation for each point diff = X - mean mdist = np.sqrt(np.sum(diff @ inv_cov * diff, axis=1)) print("Mahalanobis distances:", mdist) print("Outlier likely at index with highest distance:", np.argmax(mdist))
Takk for tilbakemeldingene dine!