Intuition for Covariance-Based Detection
Understanding how covariance matrices shape the detection of outliers is crucial for interpreting many statistical anomaly detection methods. In two-dimensional data, the covariance matrix not only determines the spread of the data but also the orientation of the regions considered "normal." You can think of the covariance matrix as defining an ellipse around the mean of your data: the size and tilt of this ellipse reflect both the variances of each feature and how those features move together. When the covariance between two features is high, the ellipse stretches diagonally, showing that changes in one feature are associated with changes in the other. If the covariance is zero, the ellipse aligns with the axes, and each feature varies independently. Outliers are then identified as points that fall far outside this ellipse, indicating they do not follow the same pattern as most of the data.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849import numpy as np import matplotlib.pyplot as plt def plot_cov_ellipse(cov, mean, ax, n_std=2.0, **kwargs): from matplotlib.patches import Ellipse import matplotlib.transforms as transforms pearson = cov[0, 1]/np.sqrt(cov[0, 0] * cov[1, 1]) ell_radius_x = np.sqrt(1 + pearson) ell_radius_y = np.sqrt(1 - pearson) ellipse = Ellipse((0, 0), width=ell_radius_x * 2, height=ell_radius_y * 2, facecolor='none', **kwargs) scale_x = np.sqrt(cov[0, 0]) * n_std scale_y = np.sqrt(cov[1, 1]) * n_std transf = transforms.Affine2D() \ .rotate_deg(45 if pearson != 0 else 0) \ .scale(scale_x, scale_y) \ .translate(mean[0], mean[1]) ellipse.set_transform(transf + ax.transData) return ax.add_patch(ellipse) np.random.seed(0) mean = [0, 0] covariances = [ np.array([[3, 0], [0, 1]]), # Axis-aligned, more spread in x np.array([[1, 0.8], [0.8, 1]]), # Tilted, strong positive correlation np.array([[1, -0.8], [-0.8, 1]]) # Tilted, strong negative correlation ] fig, axs = plt.subplots(1, 3, figsize=(15, 5)) titles = ["Axis-aligned", "Positive correlation", "Negative correlation"] for ax, cov, title in zip(axs, covariances, titles): data = np.random.multivariate_normal(mean, cov, 500) ax.scatter(data[:, 0], data[:, 1], alpha=0.3) plot_cov_ellipse(cov, mean, ax, n_std=2, edgecolor='red') ax.set_title(title) ax.set_xlim(-6, 6) ax.set_ylim(-6, 6) ax.set_aspect('equal') plt.tight_layout() plt.show()
Outlier detection based on covariance involves measuring how far a data point is from the center, accounting for the direction and spread defined by the covariance matrix. Points that lie outside the ellipse are considered outliers because they are farther from the mean than expected, given the variance and correlation structure of the data. The more elongated or tilted the ellipse, the more the algorithm "expects" data to vary in that direction, making it less likely to flag points in the long direction as outliers.
Дякуємо за ваш відгук!
Запитати АІ
Запитати АІ
Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат
Awesome!
Completion rate improved to 4.55
Intuition for Covariance-Based Detection
Свайпніть щоб показати меню
Understanding how covariance matrices shape the detection of outliers is crucial for interpreting many statistical anomaly detection methods. In two-dimensional data, the covariance matrix not only determines the spread of the data but also the orientation of the regions considered "normal." You can think of the covariance matrix as defining an ellipse around the mean of your data: the size and tilt of this ellipse reflect both the variances of each feature and how those features move together. When the covariance between two features is high, the ellipse stretches diagonally, showing that changes in one feature are associated with changes in the other. If the covariance is zero, the ellipse aligns with the axes, and each feature varies independently. Outliers are then identified as points that fall far outside this ellipse, indicating they do not follow the same pattern as most of the data.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849import numpy as np import matplotlib.pyplot as plt def plot_cov_ellipse(cov, mean, ax, n_std=2.0, **kwargs): from matplotlib.patches import Ellipse import matplotlib.transforms as transforms pearson = cov[0, 1]/np.sqrt(cov[0, 0] * cov[1, 1]) ell_radius_x = np.sqrt(1 + pearson) ell_radius_y = np.sqrt(1 - pearson) ellipse = Ellipse((0, 0), width=ell_radius_x * 2, height=ell_radius_y * 2, facecolor='none', **kwargs) scale_x = np.sqrt(cov[0, 0]) * n_std scale_y = np.sqrt(cov[1, 1]) * n_std transf = transforms.Affine2D() \ .rotate_deg(45 if pearson != 0 else 0) \ .scale(scale_x, scale_y) \ .translate(mean[0], mean[1]) ellipse.set_transform(transf + ax.transData) return ax.add_patch(ellipse) np.random.seed(0) mean = [0, 0] covariances = [ np.array([[3, 0], [0, 1]]), # Axis-aligned, more spread in x np.array([[1, 0.8], [0.8, 1]]), # Tilted, strong positive correlation np.array([[1, -0.8], [-0.8, 1]]) # Tilted, strong negative correlation ] fig, axs = plt.subplots(1, 3, figsize=(15, 5)) titles = ["Axis-aligned", "Positive correlation", "Negative correlation"] for ax, cov, title in zip(axs, covariances, titles): data = np.random.multivariate_normal(mean, cov, 500) ax.scatter(data[:, 0], data[:, 1], alpha=0.3) plot_cov_ellipse(cov, mean, ax, n_std=2, edgecolor='red') ax.set_title(title) ax.set_xlim(-6, 6) ax.set_ylim(-6, 6) ax.set_aspect('equal') plt.tight_layout() plt.show()
Outlier detection based on covariance involves measuring how far a data point is from the center, accounting for the direction and spread defined by the covariance matrix. Points that lie outside the ellipse are considered outliers because they are farther from the mean than expected, given the variance and correlation structure of the data. The more elongated or tilted the ellipse, the more the algorithm "expects" data to vary in that direction, making it less likely to flag points in the long direction as outliers.
Дякуємо за ваш відгук!