Metrics for Outlier Detection
To evaluate outlier detection models, focus on precision, recall, F1-score, and contamination rate:
- Precision: proportion of detected outliers that are truly outliers;
- Recall: proportion of true outliers your model finds;
- F1-score: harmonic mean of precision and recall, balancing their trade-off;
- Contamination rate: fraction of points labeled as outliers, often set in advance due to the rarity of true outliers.
If you label many points as outliers, recall increases but precision drops due to more false positives. Labeling fewer points increases precision but lowers recall. The F1-score shows the balance between these extremes. Contamination rate lets you control how many points your model flags, which is crucial when true outliers are rare.
There is always a trade-off between precision and recall, especially in rare event detection like outlier analysis. Increasing recall often decreases precision, and vice versa. Choosing the right balance depends on your application: in fraud detection, missing a true fraud (low recall) may be worse than investigating a few false alarms (lower precision), while in medical diagnostics, too many false positives can overwhelm resources.
12345678910111213141516171819202122232425262728import numpy as np from sklearn.datasets import make_blobs from sklearn.ensemble import IsolationForest from sklearn.metrics import precision_score, recall_score, f1_score # Create synthetic data: 2 clusters, with some outliers X, _ = make_blobs(n_samples=200, centers=1, cluster_std=0.60, random_state=42) outliers = np.random.uniform(low=-6, high=6, size=(20, 2)) X_full = np.vstack([X, outliers]) y_true = np.zeros(X_full.shape[0], dtype=int) y_true[-20:] = 1 # Mark last 20 as true outliers # Fit Isolation Forest model = IsolationForest(contamination=20 / 220, random_state=42) y_pred = model.fit_predict(X_full) # IsolationForest: -1 for outlier, 1 for inlier y_pred = (y_pred == -1).astype(int) # Calculate metrics precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred) contamination_rate = y_pred.sum() / len(y_pred) print(f"Precision: {precision:.2f}") print(f"Recall: {recall:.2f}") print(f"F1-score: {f1:.2f}") print(f"Contamination rate: {contamination_rate:.2f}")
Kiitos palautteestasi!
Kysy tekoälyä
Kysy tekoälyä
Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme
Can you explain what the contamination parameter does in IsolationForest?
How do I interpret the precision, recall, and F1-score values in this context?
What happens if I change the number of outliers or the contamination rate?
Awesome!
Completion rate improved to 4.55
Metrics for Outlier Detection
Pyyhkäise näyttääksesi valikon
To evaluate outlier detection models, focus on precision, recall, F1-score, and contamination rate:
- Precision: proportion of detected outliers that are truly outliers;
- Recall: proportion of true outliers your model finds;
- F1-score: harmonic mean of precision and recall, balancing their trade-off;
- Contamination rate: fraction of points labeled as outliers, often set in advance due to the rarity of true outliers.
If you label many points as outliers, recall increases but precision drops due to more false positives. Labeling fewer points increases precision but lowers recall. The F1-score shows the balance between these extremes. Contamination rate lets you control how many points your model flags, which is crucial when true outliers are rare.
There is always a trade-off between precision and recall, especially in rare event detection like outlier analysis. Increasing recall often decreases precision, and vice versa. Choosing the right balance depends on your application: in fraud detection, missing a true fraud (low recall) may be worse than investigating a few false alarms (lower precision), while in medical diagnostics, too many false positives can overwhelm resources.
12345678910111213141516171819202122232425262728import numpy as np from sklearn.datasets import make_blobs from sklearn.ensemble import IsolationForest from sklearn.metrics import precision_score, recall_score, f1_score # Create synthetic data: 2 clusters, with some outliers X, _ = make_blobs(n_samples=200, centers=1, cluster_std=0.60, random_state=42) outliers = np.random.uniform(low=-6, high=6, size=(20, 2)) X_full = np.vstack([X, outliers]) y_true = np.zeros(X_full.shape[0], dtype=int) y_true[-20:] = 1 # Mark last 20 as true outliers # Fit Isolation Forest model = IsolationForest(contamination=20 / 220, random_state=42) y_pred = model.fit_predict(X_full) # IsolationForest: -1 for outlier, 1 for inlier y_pred = (y_pred == -1).astype(int) # Calculate metrics precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred) contamination_rate = y_pred.sum() / len(y_pred) print(f"Precision: {precision:.2f}") print(f"Recall: {recall:.2f}") print(f"F1-score: {f1:.2f}") print(f"Contamination rate: {contamination_rate:.2f}")
Kiitos palautteestasi!