Lære Intuition for Covariance-Based Detection | Statistical and Distance-Based Methods

Understanding how covariance matrices shape the detection of outliers is crucial for interpreting many statistical anomaly detection methods. In two-dimensional data, the covariance matrix not only determines the spread of the data but also the orientation of the regions considered "normal." You can think of the covariance matrix as defining an ellipse around the mean of your data: the size and tilt of this ellipse reflect both the variances of each feature and how those features move together. When the covariance between two features is high, the ellipse stretches diagonally, showing that changes in one feature are associated with changes in the other. If the covariance is zero, the ellipse aligns with the axes, and each feature varies independently. Outliers are then identified as points that fall far outside this ellipse, indicating they do not follow the same pattern as most of the data.


              12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
            
import numpy as np
import matplotlib.pyplot as plt

def plot_cov_ellipse(cov, mean, ax, n_std=2.0, **kwargs):
    from matplotlib.patches import Ellipse
    import matplotlib.transforms as transforms

    pearson = cov[0, 1]/np.sqrt(cov[0, 0] * cov[1, 1])
    ell_radius_x = np.sqrt(1 + pearson)
    ell_radius_y = np.sqrt(1 - pearson)

    ellipse = Ellipse((0, 0),
                      width=ell_radius_x * 2,
                      height=ell_radius_y * 2,
                      facecolor='none',
                      **kwargs)

    scale_x = np.sqrt(cov[0, 0]) * n_std
    scale_y = np.sqrt(cov[1, 1]) * n_std
    transf = transforms.Affine2D() \
        .rotate_deg(45 if pearson != 0 else 0) \
        .scale(scale_x, scale_y) \
        .translate(mean[0], mean[1])

    ellipse.set_transform(transf + ax.transData)
    return ax.add_patch(ellipse)

np.random.seed(0)
mean = [0, 0]
covariances = [
    np.array([[3, 0], [0, 1]]),      # Axis-aligned, more spread in x
    np.array([[1, 0.8], [0.8, 1]]),  # Tilted, strong positive correlation
    np.array([[1, -0.8], [-0.8, 1]]) # Tilted, strong negative correlation
]

fig, axs = plt.subplots(1, 3, figsize=(15, 5))
titles = ["Axis-aligned", "Positive correlation", "Negative correlation"]

for ax, cov, title in zip(axs, covariances, titles):
    data = np.random.multivariate_normal(mean, cov, 500)
    ax.scatter(data[:, 0], data[:, 1], alpha=0.3)
    plot_cov_ellipse(cov, mean, ax, n_std=2, edgecolor='red')
    ax.set_title(title)
    ax.set_xlim(-6, 6)
    ax.set_ylim(-6, 6)
    ax.set_aspect('equal')

plt.tight_layout()
plt.show()

Note

Outlier detection based on covariance involves measuring how far a data point is from the center, accounting for the direction and spread defined by the covariance matrix. Points that lie outside the ellipse are considered outliers because they are farther from the mean than expected, given the variance and correlation structure of the data. The more elongated or tilted the ellipse, the more the algorithm "expects" data to vary in that direction, making it less likely to flag points in the long direction as outliers.

Alt var klart?

Takk for tilbakemeldingene dine!

Seksjon 2. Kapittel 3

Spør AI

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Suggested prompts:

Can you explain how the orientation of the ellipse relates to the covariance values?

How does this visualization help in detecting outliers?

Can you describe what would happen if the covariance matrix had negative values on the diagonal?

Sveip for å vise menyen


              12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
            
import numpy as np
import matplotlib.pyplot as plt

def plot_cov_ellipse(cov, mean, ax, n_std=2.0, **kwargs):
    from matplotlib.patches import Ellipse
    import matplotlib.transforms as transforms

    pearson = cov[0, 1]/np.sqrt(cov[0, 0] * cov[1, 1])
    ell_radius_x = np.sqrt(1 + pearson)
    ell_radius_y = np.sqrt(1 - pearson)

    ellipse = Ellipse((0, 0),
                      width=ell_radius_x * 2,
                      height=ell_radius_y * 2,
                      facecolor='none',
                      **kwargs)

    scale_x = np.sqrt(cov[0, 0]) * n_std
    scale_y = np.sqrt(cov[1, 1]) * n_std
    transf = transforms.Affine2D() \
        .rotate_deg(45 if pearson != 0 else 0) \
        .scale(scale_x, scale_y) \
        .translate(mean[0], mean[1])

    ellipse.set_transform(transf + ax.transData)
    return ax.add_patch(ellipse)

np.random.seed(0)
mean = [0, 0]
covariances = [
    np.array([[3, 0], [0, 1]]),      # Axis-aligned, more spread in x
    np.array([[1, 0.8], [0.8, 1]]),  # Tilted, strong positive correlation
    np.array([[1, -0.8], [-0.8, 1]]) # Tilted, strong negative correlation
]

fig, axs = plt.subplots(1, 3, figsize=(15, 5))
titles = ["Axis-aligned", "Positive correlation", "Negative correlation"]

for ax, cov, title in zip(axs, covariances, titles):
    data = np.random.multivariate_normal(mean, cov, 500)
    ax.scatter(data[:, 0], data[:, 1], alpha=0.3)
    plot_cov_ellipse(cov, mean, ax, n_std=2, edgecolor='red')
    ax.set_title(title)
    ax.set_xlim(-6, 6)
    ax.set_ylim(-6, 6)
    ax.set_aspect('equal')

plt.tight_layout()
plt.show()

Note

Alt var klart?

Takk for tilbakemeldingene dine!

Seksjon 2. Kapittel 3