Isolation Forest: Tree-Based Anomaly Detection
Isolation Forest is a powerful tree-based method for anomaly detection that relies on the principle of isolating data points through random partitioning.
The core intuition is straightforward: anomalies are data points that are few and different, making them easier to separate from the rest of the data.
Instead of modeling the distribution of normal data, Isolation Forest constructs an ensemble of random trees. Each tree recursively splits the data by randomly selecting a feature, then randomly choosing a split value between the minimum and maximum values of that feature. This process continues until each data point is isolated in its own partition.
Through this random partitioning, data points that are anomalies tend to be isolated much sooner than normal points, meaning they require fewer splits to be separated from the rest. This is because anomalies are more likely to have attribute values that are very different from those of the majority of the data. In contrast, normal points are typically located in dense regions and require more splits to be isolated.
Anomalies are easier to isolate because they are rare and have attribute values that are significantly different from the majority of the data. In Isolation Forest, the path length—the number of splits required to isolate a point—serves as the basis for the anomaly score. Shorter average path lengths across the random trees indicate higher likelihood of being an anomaly, while longer path lengths suggest the point is more like the rest of the data. This means that anomaly scores in Isolation Forest reflect how quickly a point is separated from the rest.
1234567891011121314151617181920212223242526272829303132333435363738394041import numpy as np import matplotlib.pyplot as plt from matplotlib.patches import Rectangle # Generate a simple 2D dataset np.random.seed(42) X_normal = np.random.randn(50, 2) X_outlier = np.array([[6, 6], [-6, -6]]) X = np.vstack([X_normal, X_outlier]) # Function to recursively partition and plot def plot_partition(ax, X, depth=0, max_depth=3, bounds=None): if depth == max_depth or len(X) <= 1: return if bounds is None: x_min, y_min = X.min(axis=0) x_max, y_max = X.max(axis=0) bounds = [x_min, x_max, y_min, y_max] # Randomly choose split feature and value feat = np.random.choice([0, 1]) split = np.random.uniform(X[:, feat].min(), X[:, feat].max()) if feat == 0: ax.plot([split, split], [bounds[2], bounds[3]], 'r--', alpha=0.6) left = X[X[:, 0] < split] right = X[X[:, 0] >= split] plot_partition(ax, left, depth+1, max_depth, [bounds[0], split, bounds[2], bounds[3]]) plot_partition(ax, right, depth+1, max_depth, [split, bounds[1], bounds[2], bounds[3]]) else: ax.plot([bounds[0], bounds[1]], [split, split], 'b--', alpha=0.6) below = X[X[:, 1] < split] above = X[X[:, 1] >= split] plot_partition(ax, below, depth+1, max_depth, [bounds[0], bounds[1], bounds[2], split]) plot_partition(ax, above, depth+1, max_depth, [bounds[0], bounds[1], split, bounds[3]]) fig, ax = plt.subplots(figsize=(6, 6)) ax.scatter(X_normal[:, 0], X_normal[:, 1], label="Normal", c="tab:blue") ax.scatter(X_outlier[:, 0], X_outlier[:, 1], label="Outlier", c="tab:red") plot_partition(ax, X, max_depth=3) ax.legend() ax.set_title("Isolation Forest: Random Partitioning in 2D") plt.show()
Obrigado pelo seu feedback!
Pergunte à IA
Pergunte à IA
Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo
Awesome!
Completion rate improved to 4.55
Isolation Forest: Tree-Based Anomaly Detection
Deslize para mostrar o menu
Isolation Forest is a powerful tree-based method for anomaly detection that relies on the principle of isolating data points through random partitioning.
The core intuition is straightforward: anomalies are data points that are few and different, making them easier to separate from the rest of the data.
Instead of modeling the distribution of normal data, Isolation Forest constructs an ensemble of random trees. Each tree recursively splits the data by randomly selecting a feature, then randomly choosing a split value between the minimum and maximum values of that feature. This process continues until each data point is isolated in its own partition.
Through this random partitioning, data points that are anomalies tend to be isolated much sooner than normal points, meaning they require fewer splits to be separated from the rest. This is because anomalies are more likely to have attribute values that are very different from those of the majority of the data. In contrast, normal points are typically located in dense regions and require more splits to be isolated.
Anomalies are easier to isolate because they are rare and have attribute values that are significantly different from the majority of the data. In Isolation Forest, the path length—the number of splits required to isolate a point—serves as the basis for the anomaly score. Shorter average path lengths across the random trees indicate higher likelihood of being an anomaly, while longer path lengths suggest the point is more like the rest of the data. This means that anomaly scores in Isolation Forest reflect how quickly a point is separated from the rest.
1234567891011121314151617181920212223242526272829303132333435363738394041import numpy as np import matplotlib.pyplot as plt from matplotlib.patches import Rectangle # Generate a simple 2D dataset np.random.seed(42) X_normal = np.random.randn(50, 2) X_outlier = np.array([[6, 6], [-6, -6]]) X = np.vstack([X_normal, X_outlier]) # Function to recursively partition and plot def plot_partition(ax, X, depth=0, max_depth=3, bounds=None): if depth == max_depth or len(X) <= 1: return if bounds is None: x_min, y_min = X.min(axis=0) x_max, y_max = X.max(axis=0) bounds = [x_min, x_max, y_min, y_max] # Randomly choose split feature and value feat = np.random.choice([0, 1]) split = np.random.uniform(X[:, feat].min(), X[:, feat].max()) if feat == 0: ax.plot([split, split], [bounds[2], bounds[3]], 'r--', alpha=0.6) left = X[X[:, 0] < split] right = X[X[:, 0] >= split] plot_partition(ax, left, depth+1, max_depth, [bounds[0], split, bounds[2], bounds[3]]) plot_partition(ax, right, depth+1, max_depth, [split, bounds[1], bounds[2], bounds[3]]) else: ax.plot([bounds[0], bounds[1]], [split, split], 'b--', alpha=0.6) below = X[X[:, 1] < split] above = X[X[:, 1] >= split] plot_partition(ax, below, depth+1, max_depth, [bounds[0], bounds[1], bounds[2], split]) plot_partition(ax, above, depth+1, max_depth, [bounds[0], bounds[1], split, bounds[3]]) fig, ax = plt.subplots(figsize=(6, 6)) ax.scatter(X_normal[:, 0], X_normal[:, 1], label="Normal", c="tab:blue") ax.scatter(X_outlier[:, 0], X_outlier[:, 1], label="Outlier", c="tab:red") plot_partition(ax, X, max_depth=3) ax.legend() ax.set_title("Isolation Forest: Random Partitioning in 2D") plt.show()
Obrigado pelo seu feedback!