Aprenda Isolation Forest: Tree-Based Anomaly Detection

Isolation Forest is a powerful tree-based method for anomaly detection that relies on the principle of isolating data points through random partitioning.

The core intuition is straightforward: anomalies are data points that are few and different, making them easier to separate from the rest of the data.

Instead of modeling the distribution of normal data, Isolation Forest constructs an ensemble of random trees. Each tree recursively splits the data by randomly selecting a feature, then randomly choosing a split value between the minimum and maximum values of that feature. This process continues until each data point is isolated in its own partition.

Through this random partitioning, data points that are anomalies tend to be isolated much sooner than normal points, meaning they require fewer splits to be separated from the rest. This is because anomalies are more likely to have attribute values that are very different from those of the majority of the data. In contrast, normal points are typically located in dense regions and require more splits to be isolated.

Note

Anomalies are easier to isolate because they are rare and have attribute values that are significantly different from the majority of the data. In Isolation Forest, the path length—the number of splits required to isolate a point—serves as the basis for the anomaly score. Shorter average path lengths across the random trees indicate higher likelihood of being an anomaly, while longer path lengths suggest the point is more like the rest of the data. This means that anomaly scores in Isolation Forest reflect how quickly a point is separated from the rest.


              1234567891011121314151617181920212223242526272829303132333435363738394041
            
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle

# Generate a simple 2D dataset
np.random.seed(42)
X_normal = np.random.randn(50, 2)
X_outlier = np.array([[6, 6], [-6, -6]])
X = np.vstack([X_normal, X_outlier])

# Function to recursively partition and plot
def plot_partition(ax, X, depth=0, max_depth=3, bounds=None):
    if depth == max_depth or len(X) <= 1:
        return
    if bounds is None:
        x_min, y_min = X.min(axis=0)
        x_max, y_max = X.max(axis=0)
        bounds = [x_min, x_max, y_min, y_max]
    # Randomly choose split feature and value
    feat = np.random.choice([0, 1])
    split = np.random.uniform(X[:, feat].min(), X[:, feat].max())
    if feat == 0:
        ax.plot([split, split], [bounds[2], bounds[3]], 'r--', alpha=0.6)
        left = X[X[:, 0] < split]
        right = X[X[:, 0] >= split]
        plot_partition(ax, left, depth+1, max_depth, [bounds[0], split, bounds[2], bounds[3]])
        plot_partition(ax, right, depth+1, max_depth, [split, bounds[1], bounds[2], bounds[3]])
    else:
        ax.plot([bounds[0], bounds[1]], [split, split], 'b--', alpha=0.6)
        below = X[X[:, 1] < split]
        above = X[X[:, 1] >= split]
        plot_partition(ax, below, depth+1, max_depth, [bounds[0], bounds[1], bounds[2], split])
        plot_partition(ax, above, depth+1, max_depth, [bounds[0], bounds[1], split, bounds[3]])

fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(X_normal[:, 0], X_normal[:, 1], label="Normal", c="tab:blue")
ax.scatter(X_outlier[:, 0], X_outlier[:, 1], label="Outlier", c="tab:red")
plot_partition(ax, X, max_depth=3)
ax.legend()
ax.set_title("Isolation Forest: Random Partitioning in 2D")
plt.show()

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 3. Capítulo 1

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Suggested prompts:

Can you explain how the random partitioning helps in detecting anomalies?

What does the plot tell us about the separation of normal points and outliers?

Can you describe what each color and line in the plot represents?