Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте Metrics for Outlier Detection | Foundations of Outlier and Novelty Detection
Outlier and Novelty Detection in Practice

bookMetrics for Outlier Detection

To evaluate outlier detection models, focus on precision, recall, F1-score, and contamination rate:

  • Precision: proportion of detected outliers that are truly outliers;
  • Recall: proportion of true outliers your model finds;
  • F1-score: harmonic mean of precision and recall, balancing their trade-off;
  • Contamination rate: fraction of points labeled as outliers, often set in advance due to the rarity of true outliers.

If you label many points as outliers, recall increases but precision drops due to more false positives. Labeling fewer points increases precision but lowers recall. The F1-score shows the balance between these extremes. Contamination rate lets you control how many points your model flags, which is crucial when true outliers are rare.

Note
Note

There is always a trade-off between precision and recall, especially in rare event detection like outlier analysis. Increasing recall often decreases precision, and vice versa. Choosing the right balance depends on your application: in fraud detection, missing a true fraud (low recall) may be worse than investigating a few false alarms (lower precision), while in medical diagnostics, too many false positives can overwhelm resources.

12345678910111213141516171819202122232425262728
import numpy as np from sklearn.datasets import make_blobs from sklearn.ensemble import IsolationForest from sklearn.metrics import precision_score, recall_score, f1_score # Create synthetic data: 2 clusters, with some outliers X, _ = make_blobs(n_samples=200, centers=1, cluster_std=0.60, random_state=42) outliers = np.random.uniform(low=-6, high=6, size=(20, 2)) X_full = np.vstack([X, outliers]) y_true = np.zeros(X_full.shape[0], dtype=int) y_true[-20:] = 1 # Mark last 20 as true outliers # Fit Isolation Forest model = IsolationForest(contamination=20 / 220, random_state=42) y_pred = model.fit_predict(X_full) # IsolationForest: -1 for outlier, 1 for inlier y_pred = (y_pred == -1).astype(int) # Calculate metrics precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred) contamination_rate = y_pred.sum() / len(y_pred) print(f"Precision: {precision:.2f}") print(f"Recall: {recall:.2f}") print(f"F1-score: {f1:.2f}") print(f"Contamination rate: {contamination_rate:.2f}")
copy
question mark

Which metric would you adjust if you want to reduce the number of false positives in your outlier detection model?

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 1. Розділ 3

Запитати АІ

expand

Запитати АІ

ChatGPT

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Awesome!

Completion rate improved to 4.55

bookMetrics for Outlier Detection

Свайпніть щоб показати меню

To evaluate outlier detection models, focus on precision, recall, F1-score, and contamination rate:

  • Precision: proportion of detected outliers that are truly outliers;
  • Recall: proportion of true outliers your model finds;
  • F1-score: harmonic mean of precision and recall, balancing their trade-off;
  • Contamination rate: fraction of points labeled as outliers, often set in advance due to the rarity of true outliers.

If you label many points as outliers, recall increases but precision drops due to more false positives. Labeling fewer points increases precision but lowers recall. The F1-score shows the balance between these extremes. Contamination rate lets you control how many points your model flags, which is crucial when true outliers are rare.

Note
Note

There is always a trade-off between precision and recall, especially in rare event detection like outlier analysis. Increasing recall often decreases precision, and vice versa. Choosing the right balance depends on your application: in fraud detection, missing a true fraud (low recall) may be worse than investigating a few false alarms (lower precision), while in medical diagnostics, too many false positives can overwhelm resources.

12345678910111213141516171819202122232425262728
import numpy as np from sklearn.datasets import make_blobs from sklearn.ensemble import IsolationForest from sklearn.metrics import precision_score, recall_score, f1_score # Create synthetic data: 2 clusters, with some outliers X, _ = make_blobs(n_samples=200, centers=1, cluster_std=0.60, random_state=42) outliers = np.random.uniform(low=-6, high=6, size=(20, 2)) X_full = np.vstack([X, outliers]) y_true = np.zeros(X_full.shape[0], dtype=int) y_true[-20:] = 1 # Mark last 20 as true outliers # Fit Isolation Forest model = IsolationForest(contamination=20 / 220, random_state=42) y_pred = model.fit_predict(X_full) # IsolationForest: -1 for outlier, 1 for inlier y_pred = (y_pred == -1).astype(int) # Calculate metrics precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred) contamination_rate = y_pred.sum() / len(y_pred) print(f"Precision: {precision:.2f}") print(f"Recall: {recall:.2f}") print(f"F1-score: {f1:.2f}") print(f"Contamination rate: {contamination_rate:.2f}")
copy
question mark

Which metric would you adjust if you want to reduce the number of false positives in your outlier detection model?

Select the correct answer

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 1. Розділ 3
some-alt