Classical Statistical Approaches
Classical statistical approaches provide foundational techniques for detecting outliers by leveraging the mathematical properties of data distributions. You will explore three essential methods: Z-score, interquartile range (IQR), and Mahalanobis distance. Each method offers a different perspective, suited to various data types and structures.
The Z-score measures how many standard deviations a data point is from the mean. For a data point x, the Z-score is calculated as:
Z=σx−μwhere μ is the sample mean and σ is the standard deviation. A large absolute Z-score indicates that a point is far from the mean, potentially an outlier.
The IQR is based on percentiles and is robust to non-normal distributions. It is defined as the range between the 75th percentile (Q3) and the 25th percentile (Q1):
IQR=Q3−Q1Points that fall Q1−1.5×IQR or above Q3+1.5×IQR are flagged as outliers.
Mahalanobis distance extends the concept of distance to multivariate data, accounting for the covariance between features. For a vector x and data with mean μ and covariance matrix $S$, the Mahalanobis distance is:
DM(x)=(x−μ)TS−1(x−μ)A larger Mahalanobis distance indicates that a point is far from the mean in the context of the data's spread and correlation structure.
Assumptions and Limitations:
- Z-score assumes the data is normally distributed; it can be misleading for skewed or heavy-tailed distributions;
- IQR is robust to outliers and non-normality but may miss extreme values in small samples or multi-modal data;
- Mahalanobis distance assumes multivariate normality and requires reliable estimation of the covariance matrix; it is sensitive to correlated features and can be unstable if the number of features is close to the number of samples.
1234567891011121314151617181920212223import numpy as np # 1D data array data = np.array([10, 12, 12, 13, 12, 14, 13, 100, 12, 11, 13, 10, 12]) # Z-score calculation mean = np.mean(data) std = np.std(data) z_scores = (data - mean) / std # Flag outliers with Z-score > 3 or < -3 z_outliers = np.where(np.abs(z_scores) > 3)[0] # IQR calculation q1 = np.percentile(data, 25) q3 = np.percentile(data, 75) iqr = q3 - q1 lower_bound = q1 - 1.5 * iqr upper_bound = q3 + 1.5 * iqr iqr_outliers = np.where((data < lower_bound) | (data > upper_bound))[0] print("Z-score outlier indices:", z_outliers) print("IQR outlier indices:", iqr_outliers)
12345678910111213141516171819202122import numpy as np # 2D data array (each row is a sample) X = np.array([ [2, 3], [3, 5], [4, 4], [5, 7], [100, 200] # outlier ]) # Calculate mean and covariance mean = np.mean(X, axis=0) cov = np.cov(X, rowvar=False) inv_cov = np.linalg.inv(cov) # Mahalanobis distance calculation for each point diff = X - mean mdist = np.sqrt(np.sum(diff @ inv_cov * diff, axis=1)) print("Mahalanobis distances:", mdist) print("Outlier likely at index with highest distance:", np.argmax(mdist))
Kiitos palautteestasi!
Kysy tekoälyä
Kysy tekoälyä
Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme
Can you explain when to use each outlier detection method?
What are the limitations of these statistical approaches?
How do I interpret the results from the code samples?
Awesome!
Completion rate improved to 4.55
Classical Statistical Approaches
Pyyhkäise näyttääksesi valikon
Classical statistical approaches provide foundational techniques for detecting outliers by leveraging the mathematical properties of data distributions. You will explore three essential methods: Z-score, interquartile range (IQR), and Mahalanobis distance. Each method offers a different perspective, suited to various data types and structures.
The Z-score measures how many standard deviations a data point is from the mean. For a data point x, the Z-score is calculated as:
Z=σx−μwhere μ is the sample mean and σ is the standard deviation. A large absolute Z-score indicates that a point is far from the mean, potentially an outlier.
The IQR is based on percentiles and is robust to non-normal distributions. It is defined as the range between the 75th percentile (Q3) and the 25th percentile (Q1):
IQR=Q3−Q1Points that fall Q1−1.5×IQR or above Q3+1.5×IQR are flagged as outliers.
Mahalanobis distance extends the concept of distance to multivariate data, accounting for the covariance between features. For a vector x and data with mean μ and covariance matrix $S$, the Mahalanobis distance is:
DM(x)=(x−μ)TS−1(x−μ)A larger Mahalanobis distance indicates that a point is far from the mean in the context of the data's spread and correlation structure.
Assumptions and Limitations:
- Z-score assumes the data is normally distributed; it can be misleading for skewed or heavy-tailed distributions;
- IQR is robust to outliers and non-normality but may miss extreme values in small samples or multi-modal data;
- Mahalanobis distance assumes multivariate normality and requires reliable estimation of the covariance matrix; it is sensitive to correlated features and can be unstable if the number of features is close to the number of samples.
1234567891011121314151617181920212223import numpy as np # 1D data array data = np.array([10, 12, 12, 13, 12, 14, 13, 100, 12, 11, 13, 10, 12]) # Z-score calculation mean = np.mean(data) std = np.std(data) z_scores = (data - mean) / std # Flag outliers with Z-score > 3 or < -3 z_outliers = np.where(np.abs(z_scores) > 3)[0] # IQR calculation q1 = np.percentile(data, 25) q3 = np.percentile(data, 75) iqr = q3 - q1 lower_bound = q1 - 1.5 * iqr upper_bound = q3 + 1.5 * iqr iqr_outliers = np.where((data < lower_bound) | (data > upper_bound))[0] print("Z-score outlier indices:", z_outliers) print("IQR outlier indices:", iqr_outliers)
12345678910111213141516171819202122import numpy as np # 2D data array (each row is a sample) X = np.array([ [2, 3], [3, 5], [4, 4], [5, 7], [100, 200] # outlier ]) # Calculate mean and covariance mean = np.mean(X, axis=0) cov = np.cov(X, rowvar=False) inv_cov = np.linalg.inv(cov) # Mahalanobis distance calculation for each point diff = X - mean mdist = np.sqrt(np.sum(diff @ inv_cov * diff, axis=1)) print("Mahalanobis distances:", mdist) print("Outlier likely at index with highest distance:", np.argmax(mdist))
Kiitos palautteestasi!