Lära Robust Covariance and Gaussian Assumption | Statistical and Distance-Based Methods

Robust covariance estimation is a foundational approach in outlier detection, particularly when you assume that your data roughly follows a multivariate Gaussian distribution. The central idea is to estimate the mean and covariance of your data in a way that is not unduly influenced by outliers. If you use the classic covariance calculation, even a few extreme points can distort the result, making it unreliable for identifying anomalies. The Elliptic Envelope algorithm addresses this by fitting an ellipse (in higher dimensions, an ellipsoid) to the central mass of the data, using robust statistics that reduce the impact of outliers. This fitted ellipse represents the region where most "normal" data points are expected to fall, based on the estimated mean and covariance. Points lying far outside this envelope are flagged as outliers, as they are unlikely under the assumed Gaussian model.

Note

The Gaussian assumption is reasonable when your data is roughly symmetric, unimodal, and does not have heavy tails or strong skewness. Many natural and measurement processes produce data that is approximately Gaussian, especially after proper preprocessing. However, real-world data often deviates from the Gaussian ideal, either due to underlying structure or the presence of outliers. Robust covariance estimators, like those used in the Elliptic Envelope, are more resilient to these outliers, but their effectiveness still depends on the core data being roughly Gaussian. If the true distribution is far from Gaussian, the method may misclassify normal points as outliers or miss actual anomalies.


              123456789101112131415161718192021222324252627282930313233
            
import numpy as np
import matplotlib.pyplot as plt
from sklearn.covariance import EllipticEnvelope

# Generate synthetic 2D data
rng = np.random.RandomState(42)
X = 0.3 * rng.randn(100, 2)
X = np.r_[X + 2, X - 2]  # Two clusters
# Add some outliers
X_outliers = rng.uniform(low=-4, high=4, size=(20, 2))
X_full = np.vstack([X, X_outliers])

# Fit the Elliptic Envelope
envelope = EllipticEnvelope(contamination=0.1)
envelope.fit(X_full)
y_pred = envelope.predict(X_full)

# Plot the data and decision boundary
plt.figure(figsize=(8, 6))
plt.scatter(X_full[y_pred == 1, 0], X_full[y_pred == 1, 1], color="blue", label="Inliers")
plt.scatter(X_full[y_pred == -1, 0], X_full[y_pred == -1, 1], color="red", label="Outliers")

# Plot the decision ellipse
xx, yy = np.meshgrid(np.linspace(-5, 5, 500), np.linspace(-5, 5, 500))
Z = envelope.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors="black")

plt.title("Elliptic Envelope: Robust Covariance Outlier Detection")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 2. Kapitel 2

Fråga AI

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Svep för att visa menyn

Note


              123456789101112131415161718192021222324252627282930313233
            
import numpy as np
import matplotlib.pyplot as plt
from sklearn.covariance import EllipticEnvelope

# Generate synthetic 2D data
rng = np.random.RandomState(42)
X = 0.3 * rng.randn(100, 2)
X = np.r_[X + 2, X - 2]  # Two clusters
# Add some outliers
X_outliers = rng.uniform(low=-4, high=4, size=(20, 2))
X_full = np.vstack([X, X_outliers])

# Fit the Elliptic Envelope
envelope = EllipticEnvelope(contamination=0.1)
envelope.fit(X_full)
y_pred = envelope.predict(X_full)

# Plot the data and decision boundary
plt.figure(figsize=(8, 6))
plt.scatter(X_full[y_pred == 1, 0], X_full[y_pred == 1, 1], color="blue", label="Inliers")
plt.scatter(X_full[y_pred == -1, 0], X_full[y_pred == -1, 1], color="red", label="Outliers")

# Plot the decision ellipse
xx, yy = np.meshgrid(np.linspace(-5, 5, 500), np.linspace(-5, 5, 500))
Z = envelope.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors="black")

plt.title("Elliptic Envelope: Robust Covariance Outlier Detection")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 2. Kapitel 2