Metrics for Outlier Detection
To evaluate outlier detection models, focus on precision, recall, F1-score, and contamination rate:
- Precision: proportion of detected outliers that are truly outliers;
- Recall: proportion of true outliers your model finds;
- F1-score: harmonic mean of precision and recall, balancing their trade-off;
- Contamination rate: fraction of points labeled as outliers, often set in advance due to the rarity of true outliers.
If you label many points as outliers, recall increases but precision drops due to more false positives. Labeling fewer points increases precision but lowers recall. The F1-score shows the balance between these extremes. Contamination rate lets you control how many points your model flags, which is crucial when true outliers are rare.
There is always a trade-off between precision and recall, especially in rare event detection like outlier analysis. Increasing recall often decreases precision, and vice versa. Choosing the right balance depends on your application: in fraud detection, missing a true fraud (low recall) may be worse than investigating a few false alarms (lower precision), while in medical diagnostics, too many false positives can overwhelm resources.
12345678910111213141516171819202122232425262728import numpy as np from sklearn.datasets import make_blobs from sklearn.ensemble import IsolationForest from sklearn.metrics import precision_score, recall_score, f1_score # Create synthetic data: 2 clusters, with some outliers X, _ = make_blobs(n_samples=200, centers=1, cluster_std=0.60, random_state=42) outliers = np.random.uniform(low=-6, high=6, size=(20, 2)) X_full = np.vstack([X, outliers]) y_true = np.zeros(X_full.shape[0], dtype=int) y_true[-20:] = 1 # Mark last 20 as true outliers # Fit Isolation Forest model = IsolationForest(contamination=20 / 220, random_state=42) y_pred = model.fit_predict(X_full) # IsolationForest: -1 for outlier, 1 for inlier y_pred = (y_pred == -1).astype(int) # Calculate metrics precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred) contamination_rate = y_pred.sum() / len(y_pred) print(f"Precision: {precision:.2f}") print(f"Recall: {recall:.2f}") print(f"F1-score: {f1:.2f}") print(f"Contamination rate: {contamination_rate:.2f}")
Obrigado pelo seu feedback!
Pergunte à IA
Pergunte à IA
Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo
Awesome!
Completion rate improved to 4.55
Metrics for Outlier Detection
Deslize para mostrar o menu
To evaluate outlier detection models, focus on precision, recall, F1-score, and contamination rate:
- Precision: proportion of detected outliers that are truly outliers;
- Recall: proportion of true outliers your model finds;
- F1-score: harmonic mean of precision and recall, balancing their trade-off;
- Contamination rate: fraction of points labeled as outliers, often set in advance due to the rarity of true outliers.
If you label many points as outliers, recall increases but precision drops due to more false positives. Labeling fewer points increases precision but lowers recall. The F1-score shows the balance between these extremes. Contamination rate lets you control how many points your model flags, which is crucial when true outliers are rare.
There is always a trade-off between precision and recall, especially in rare event detection like outlier analysis. Increasing recall often decreases precision, and vice versa. Choosing the right balance depends on your application: in fraud detection, missing a true fraud (low recall) may be worse than investigating a few false alarms (lower precision), while in medical diagnostics, too many false positives can overwhelm resources.
12345678910111213141516171819202122232425262728import numpy as np from sklearn.datasets import make_blobs from sklearn.ensemble import IsolationForest from sklearn.metrics import precision_score, recall_score, f1_score # Create synthetic data: 2 clusters, with some outliers X, _ = make_blobs(n_samples=200, centers=1, cluster_std=0.60, random_state=42) outliers = np.random.uniform(low=-6, high=6, size=(20, 2)) X_full = np.vstack([X, outliers]) y_true = np.zeros(X_full.shape[0], dtype=int) y_true[-20:] = 1 # Mark last 20 as true outliers # Fit Isolation Forest model = IsolationForest(contamination=20 / 220, random_state=42) y_pred = model.fit_predict(X_full) # IsolationForest: -1 for outlier, 1 for inlier y_pred = (y_pred == -1).astype(int) # Calculate metrics precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred) contamination_rate = y_pred.sum() / len(y_pred) print(f"Precision: {precision:.2f}") print(f"Recall: {recall:.2f}") print(f"F1-score: {f1:.2f}") print(f"Contamination rate: {contamination_rate:.2f}")
Obrigado pelo seu feedback!