Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Leer Hyperparameters and Anomaly Scores | Isolation-Based Methods
Outlier and Novelty Detection in Practice

bookHyperparameters and Anomaly Scores

Isolation Forest is a powerful algorithm for detecting outliers by isolating observations in the feature space.

To use it effectively, you need to understand its main hyperparameters:

  • contamination;
  • n_estimators;
  • max_samples.

The contamination parameter estimates the proportion of outliers in your dataset. By setting contamination, you tell Isolation Forest how many data points you expect to be anomalies. This value directly influences the threshold for classifying points as outliers. For example, if you set contamination=0.1, the algorithm will flag the top 10% most anomalous points.

The n_estimators parameter controls the number of trees in the Isolation Forest. More trees usually increase the stability and accuracy of the anomaly score, but also require more computation. Typical values range from 100 to 300.

The max_samples parameter determines how many samples are used to build each tree. If set to "auto", it uses the minimum of 256 or the number of samples in your dataset. You can also specify an integer or a float (as a fraction of the dataset). Lower values make trees more random and diverse, while higher values may improve detection in larger datasets.

Note
Note

The Isolation Forest assigns an anomaly score to each data point. Lower scores indicate more normal observations, while higher scores suggest possible outliers. To flag outliers, you set a threshold based on the anomaly score — often determined by the contamination parameter. If you expect 5% of your data to be outliers, set contamination=0.05 and let the algorithm use this to define the cutoff. Always review the flagged points and adjust the threshold if you notice too many or too few anomalies.

1234567891011121314151617181920212223242526272829303132
import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import IsolationForest # Create a synthetic dataset rng = np.random.RandomState(42) X = 0.3 * rng.randn(200, 2) X_outliers = rng.uniform(low=-4, high=4, size=(20, 2)) X = np.r_[X, X_outliers] # Try two different contamination and n_estimators settings settings = [ {"contamination": 0.05, "n_estimators": 50, "label": "5% contamination, 50 trees"}, {"contamination": 0.15, "n_estimators": 200, "label": "15% contamination, 200 trees"} ] plt.figure(figsize=(10, 5)) for i, params in enumerate(settings, 1): clf = IsolationForest( contamination=params["contamination"], n_estimators=params["n_estimators"], random_state=42 ) clf.fit(X) y_pred = clf.predict(X) plt.subplot(1, 2, i) plt.title(params["label"]) plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap="coolwarm", edgecolor="k", s=40) plt.xlabel("Feature 1") plt.ylabel("Feature 2") plt.tight_layout() plt.show()
copy
question mark

Which statement best describes how the contamination parameter affects Isolation Forest?

Select the correct answer

Was alles duidelijk?

Hoe kunnen we het verbeteren?

Bedankt voor je feedback!

Sectie 3. Hoofdstuk 2

Vraag AI

expand

Vraag AI

ChatGPT

Vraag wat u wilt of probeer een van de voorgestelde vragen om onze chat te starten.

Awesome!

Completion rate improved to 4.55

bookHyperparameters and Anomaly Scores

Veeg om het menu te tonen

Isolation Forest is a powerful algorithm for detecting outliers by isolating observations in the feature space.

To use it effectively, you need to understand its main hyperparameters:

  • contamination;
  • n_estimators;
  • max_samples.

The contamination parameter estimates the proportion of outliers in your dataset. By setting contamination, you tell Isolation Forest how many data points you expect to be anomalies. This value directly influences the threshold for classifying points as outliers. For example, if you set contamination=0.1, the algorithm will flag the top 10% most anomalous points.

The n_estimators parameter controls the number of trees in the Isolation Forest. More trees usually increase the stability and accuracy of the anomaly score, but also require more computation. Typical values range from 100 to 300.

The max_samples parameter determines how many samples are used to build each tree. If set to "auto", it uses the minimum of 256 or the number of samples in your dataset. You can also specify an integer or a float (as a fraction of the dataset). Lower values make trees more random and diverse, while higher values may improve detection in larger datasets.

Note
Note

The Isolation Forest assigns an anomaly score to each data point. Lower scores indicate more normal observations, while higher scores suggest possible outliers. To flag outliers, you set a threshold based on the anomaly score — often determined by the contamination parameter. If you expect 5% of your data to be outliers, set contamination=0.05 and let the algorithm use this to define the cutoff. Always review the flagged points and adjust the threshold if you notice too many or too few anomalies.

1234567891011121314151617181920212223242526272829303132
import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import IsolationForest # Create a synthetic dataset rng = np.random.RandomState(42) X = 0.3 * rng.randn(200, 2) X_outliers = rng.uniform(low=-4, high=4, size=(20, 2)) X = np.r_[X, X_outliers] # Try two different contamination and n_estimators settings settings = [ {"contamination": 0.05, "n_estimators": 50, "label": "5% contamination, 50 trees"}, {"contamination": 0.15, "n_estimators": 200, "label": "15% contamination, 200 trees"} ] plt.figure(figsize=(10, 5)) for i, params in enumerate(settings, 1): clf = IsolationForest( contamination=params["contamination"], n_estimators=params["n_estimators"], random_state=42 ) clf.fit(X) y_pred = clf.predict(X) plt.subplot(1, 2, i) plt.title(params["label"]) plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap="coolwarm", edgecolor="k", s=40) plt.xlabel("Feature 1") plt.ylabel("Feature 2") plt.tight_layout() plt.show()
copy
question mark

Which statement best describes how the contamination parameter affects Isolation Forest?

Select the correct answer

Was alles duidelijk?

Hoe kunnen we het verbeteren?

Bedankt voor je feedback!

Sectie 3. Hoofdstuk 2
some-alt