Impara Hyperparameters and Anomaly Scores

Isolation Forest is a powerful algorithm for detecting outliers by isolating observations in the feature space.

To use it effectively, you need to understand its main hyperparameters:

contamination;
n_estimators;
max_samples.

The contamination parameter estimates the proportion of outliers in your dataset. By setting contamination, you tell Isolation Forest how many data points you expect to be anomalies. This value directly influences the threshold for classifying points as outliers. For example, if you set contamination=0.1, the algorithm will flag the top 10% most anomalous points.

The n_estimators parameter controls the number of trees in the Isolation Forest. More trees usually increase the stability and accuracy of the anomaly score, but also require more computation. Typical values range from 100 to 300.

The max_samples parameter determines how many samples are used to build each tree. If set to "auto", it uses the minimum of 256 or the number of samples in your dataset. You can also specify an integer or a float (as a fraction of the dataset). Lower values make trees more random and diverse, while higher values may improve detection in larger datasets.

Note

The Isolation Forest assigns an anomaly score to each data point. Lower scores indicate more normal observations, while higher scores suggest possible outliers. To flag outliers, you set a threshold based on the anomaly score — often determined by the contamination parameter. If you expect 5% of your data to be outliers, set contamination=0.05 and let the algorithm use this to define the cutoff. Always review the flagged points and adjust the threshold if you notice too many or too few anomalies.


              1234567891011121314151617181920212223242526272829303132
            
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

# Create a synthetic dataset
rng = np.random.RandomState(42)
X = 0.3 * rng.randn(200, 2)
X_outliers = rng.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X, X_outliers]

# Try two different contamination and n_estimators settings
settings = [
    {"contamination": 0.05, "n_estimators": 50, "label": "5% contamination, 50 trees"},
    {"contamination": 0.15, "n_estimators": 200, "label": "15% contamination, 200 trees"}
]

plt.figure(figsize=(10, 5))
for i, params in enumerate(settings, 1):
    clf = IsolationForest(
        contamination=params["contamination"],
        n_estimators=params["n_estimators"],
        random_state=42
    )
    clf.fit(X)
    y_pred = clf.predict(X)
    plt.subplot(1, 2, i)
    plt.title(params["label"])
    plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap="coolwarm", edgecolor="k", s=40)
    plt.xlabel("Feature 1")
    plt.ylabel("Feature 2")
    plt.tight_layout()
plt.show()

Tutto è chiaro?

Grazie per i tuoi commenti!

Sezione 3. Capitolo 2

Chieda ad AI

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Suggested prompts:

Can you explain how changing the contamination parameter affects the results?

What does the color coding in the scatter plots represent?

How should I choose the best values for these hyperparameters in practice?

Scorri per mostrare il menu