Hyperparameters and Anomaly Scores
Isolation Forest is a powerful algorithm for detecting outliers by isolating observations in the feature space.
To use it effectively, you need to understand its main hyperparameters:
contamination;n_estimators;max_samples.
The contamination parameter estimates the proportion of outliers in your dataset. By setting contamination, you tell Isolation Forest how many data points you expect to be anomalies. This value directly influences the threshold for classifying points as outliers. For example, if you set contamination=0.1, the algorithm will flag the top 10% most anomalous points.
The n_estimators parameter controls the number of trees in the Isolation Forest. More trees usually increase the stability and accuracy of the anomaly score, but also require more computation. Typical values range from 100 to 300.
The max_samples parameter determines how many samples are used to build each tree. If set to "auto", it uses the minimum of 256 or the number of samples in your dataset. You can also specify an integer or a float (as a fraction of the dataset). Lower values make trees more random and diverse, while higher values may improve detection in larger datasets.
The Isolation Forest assigns an anomaly score to each data point. Lower scores indicate more normal observations, while higher scores suggest possible outliers. To flag outliers, you set a threshold based on the anomaly score — often determined by the contamination parameter. If you expect 5% of your data to be outliers, set contamination=0.05 and let the algorithm use this to define the cutoff. Always review the flagged points and adjust the threshold if you notice too many or too few anomalies.
1234567891011121314151617181920212223242526272829303132import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import IsolationForest # Create a synthetic dataset rng = np.random.RandomState(42) X = 0.3 * rng.randn(200, 2) X_outliers = rng.uniform(low=-4, high=4, size=(20, 2)) X = np.r_[X, X_outliers] # Try two different contamination and n_estimators settings settings = [ {"contamination": 0.05, "n_estimators": 50, "label": "5% contamination, 50 trees"}, {"contamination": 0.15, "n_estimators": 200, "label": "15% contamination, 200 trees"} ] plt.figure(figsize=(10, 5)) for i, params in enumerate(settings, 1): clf = IsolationForest( contamination=params["contamination"], n_estimators=params["n_estimators"], random_state=42 ) clf.fit(X) y_pred = clf.predict(X) plt.subplot(1, 2, i) plt.title(params["label"]) plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap="coolwarm", edgecolor="k", s=40) plt.xlabel("Feature 1") plt.ylabel("Feature 2") plt.tight_layout() plt.show()
Grazie per i tuoi commenti!
Chieda ad AI
Chieda ad AI
Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione
Awesome!
Completion rate improved to 4.55
Hyperparameters and Anomaly Scores
Scorri per mostrare il menu
Isolation Forest is a powerful algorithm for detecting outliers by isolating observations in the feature space.
To use it effectively, you need to understand its main hyperparameters:
contamination;n_estimators;max_samples.
The contamination parameter estimates the proportion of outliers in your dataset. By setting contamination, you tell Isolation Forest how many data points you expect to be anomalies. This value directly influences the threshold for classifying points as outliers. For example, if you set contamination=0.1, the algorithm will flag the top 10% most anomalous points.
The n_estimators parameter controls the number of trees in the Isolation Forest. More trees usually increase the stability and accuracy of the anomaly score, but also require more computation. Typical values range from 100 to 300.
The max_samples parameter determines how many samples are used to build each tree. If set to "auto", it uses the minimum of 256 or the number of samples in your dataset. You can also specify an integer or a float (as a fraction of the dataset). Lower values make trees more random and diverse, while higher values may improve detection in larger datasets.
The Isolation Forest assigns an anomaly score to each data point. Lower scores indicate more normal observations, while higher scores suggest possible outliers. To flag outliers, you set a threshold based on the anomaly score — often determined by the contamination parameter. If you expect 5% of your data to be outliers, set contamination=0.05 and let the algorithm use this to define the cutoff. Always review the flagged points and adjust the threshold if you notice too many or too few anomalies.
1234567891011121314151617181920212223242526272829303132import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import IsolationForest # Create a synthetic dataset rng = np.random.RandomState(42) X = 0.3 * rng.randn(200, 2) X_outliers = rng.uniform(low=-4, high=4, size=(20, 2)) X = np.r_[X, X_outliers] # Try two different contamination and n_estimators settings settings = [ {"contamination": 0.05, "n_estimators": 50, "label": "5% contamination, 50 trees"}, {"contamination": 0.15, "n_estimators": 200, "label": "15% contamination, 200 trees"} ] plt.figure(figsize=(10, 5)) for i, params in enumerate(settings, 1): clf = IsolationForest( contamination=params["contamination"], n_estimators=params["n_estimators"], random_state=42 ) clf.fit(X) y_pred = clf.predict(X) plt.subplot(1, 2, i) plt.title(params["label"]) plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap="coolwarm", edgecolor="k", s=40) plt.xlabel("Feature 1") plt.ylabel("Feature 2") plt.tight_layout() plt.show()
Grazie per i tuoi commenti!