Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Hyperparameters and Anomaly Scores | Isolation-Based Methods
Outlier and Novelty Detection in Practice

bookHyperparameters and Anomaly Scores

Isolation Forest is a powerful algorithm for detecting outliers by isolating observations in the feature space.

To use it effectively, you need to understand its main hyperparameters:

  • contamination;
  • n_estimators;
  • max_samples.

The contamination parameter estimates the proportion of outliers in your dataset. By setting contamination, you tell Isolation Forest how many data points you expect to be anomalies. This value directly influences the threshold for classifying points as outliers. For example, if you set contamination=0.1, the algorithm will flag the top 10% most anomalous points.

The n_estimators parameter controls the number of trees in the Isolation Forest. More trees usually increase the stability and accuracy of the anomaly score, but also require more computation. Typical values range from 100 to 300.

The max_samples parameter determines how many samples are used to build each tree. If set to "auto", it uses the minimum of 256 or the number of samples in your dataset. You can also specify an integer or a float (as a fraction of the dataset). Lower values make trees more random and diverse, while higher values may improve detection in larger datasets.

Note
Note

The Isolation Forest assigns an anomaly score to each data point. Lower scores indicate more normal observations, while higher scores suggest possible outliers. To flag outliers, you set a threshold based on the anomaly score — often determined by the contamination parameter. If you expect 5% of your data to be outliers, set contamination=0.05 and let the algorithm use this to define the cutoff. Always review the flagged points and adjust the threshold if you notice too many or too few anomalies.

1234567891011121314151617181920212223242526272829303132
import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import IsolationForest # Create a synthetic dataset rng = np.random.RandomState(42) X = 0.3 * rng.randn(200, 2) X_outliers = rng.uniform(low=-4, high=4, size=(20, 2)) X = np.r_[X, X_outliers] # Try two different contamination and n_estimators settings settings = [ {"contamination": 0.05, "n_estimators": 50, "label": "5% contamination, 50 trees"}, {"contamination": 0.15, "n_estimators": 200, "label": "15% contamination, 200 trees"} ] plt.figure(figsize=(10, 5)) for i, params in enumerate(settings, 1): clf = IsolationForest( contamination=params["contamination"], n_estimators=params["n_estimators"], random_state=42 ) clf.fit(X) y_pred = clf.predict(X) plt.subplot(1, 2, i) plt.title(params["label"]) plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap="coolwarm", edgecolor="k", s=40) plt.xlabel("Feature 1") plt.ylabel("Feature 2") plt.tight_layout() plt.show()
copy
question mark

Which statement best describes how the contamination parameter affects Isolation Forest?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 3. Kapitel 2

Fragen Sie AI

expand

Fragen Sie AI

ChatGPT

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Suggested prompts:

Can you explain how changing the contamination parameter affects the results?

What does the color coding in the scatter plots represent?

How should I choose the best values for these hyperparameters in practice?

Awesome!

Completion rate improved to 4.55

bookHyperparameters and Anomaly Scores

Swipe um das Menü anzuzeigen

Isolation Forest is a powerful algorithm for detecting outliers by isolating observations in the feature space.

To use it effectively, you need to understand its main hyperparameters:

  • contamination;
  • n_estimators;
  • max_samples.

The contamination parameter estimates the proportion of outliers in your dataset. By setting contamination, you tell Isolation Forest how many data points you expect to be anomalies. This value directly influences the threshold for classifying points as outliers. For example, if you set contamination=0.1, the algorithm will flag the top 10% most anomalous points.

The n_estimators parameter controls the number of trees in the Isolation Forest. More trees usually increase the stability and accuracy of the anomaly score, but also require more computation. Typical values range from 100 to 300.

The max_samples parameter determines how many samples are used to build each tree. If set to "auto", it uses the minimum of 256 or the number of samples in your dataset. You can also specify an integer or a float (as a fraction of the dataset). Lower values make trees more random and diverse, while higher values may improve detection in larger datasets.

Note
Note

The Isolation Forest assigns an anomaly score to each data point. Lower scores indicate more normal observations, while higher scores suggest possible outliers. To flag outliers, you set a threshold based on the anomaly score — often determined by the contamination parameter. If you expect 5% of your data to be outliers, set contamination=0.05 and let the algorithm use this to define the cutoff. Always review the flagged points and adjust the threshold if you notice too many or too few anomalies.

1234567891011121314151617181920212223242526272829303132
import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import IsolationForest # Create a synthetic dataset rng = np.random.RandomState(42) X = 0.3 * rng.randn(200, 2) X_outliers = rng.uniform(low=-4, high=4, size=(20, 2)) X = np.r_[X, X_outliers] # Try two different contamination and n_estimators settings settings = [ {"contamination": 0.05, "n_estimators": 50, "label": "5% contamination, 50 trees"}, {"contamination": 0.15, "n_estimators": 200, "label": "15% contamination, 200 trees"} ] plt.figure(figsize=(10, 5)) for i, params in enumerate(settings, 1): clf = IsolationForest( contamination=params["contamination"], n_estimators=params["n_estimators"], random_state=42 ) clf.fit(X) y_pred = clf.predict(X) plt.subplot(1, 2, i) plt.title(params["label"]) plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap="coolwarm", edgecolor="k", s=40) plt.xlabel("Feature 1") plt.ylabel("Feature 2") plt.tight_layout() plt.show()
copy
question mark

Which statement best describes how the contamination parameter affects Isolation Forest?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 3. Kapitel 2
some-alt