Aprenda Metrics for Outlier Detection | Foundations of Outlier and Novelty Detection

To evaluate outlier detection models, focus on precision, recall, F1-score, and contamination rate:

Precision: proportion of detected outliers that are truly outliers;
Recall: proportion of true outliers your model finds;
F1-score: harmonic mean of precision and recall, balancing their trade-off;
Contamination rate: fraction of points labeled as outliers, often set in advance due to the rarity of true outliers.

If you label many points as outliers, recall increases but precision drops due to more false positives. Labeling fewer points increases precision but lowers recall. The F1-score shows the balance between these extremes. Contamination rate lets you control how many points your model flags, which is crucial when true outliers are rare.

Note

There is always a trade-off between precision and recall, especially in rare event detection like outlier analysis. Increasing recall often decreases precision, and vice versa. Choosing the right balance depends on your application: in fraud detection, missing a true fraud (low recall) may be worse than investigating a few false alarms (lower precision), while in medical diagnostics, too many false positives can overwhelm resources.


              12345678910111213141516171819202122232425262728
            
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.ensemble import IsolationForest
from sklearn.metrics import precision_score, recall_score, f1_score

# Create synthetic data: 2 clusters, with some outliers
X, _ = make_blobs(n_samples=200, centers=1, cluster_std=0.60, random_state=42)
outliers = np.random.uniform(low=-6, high=6, size=(20, 2))
X_full = np.vstack([X, outliers])
y_true = np.zeros(X_full.shape[0], dtype=int)
y_true[-20:] = 1  # Mark last 20 as true outliers

# Fit Isolation Forest
model = IsolationForest(contamination=20 / 220, random_state=42)
y_pred = model.fit_predict(X_full)
# IsolationForest: -1 for outlier, 1 for inlier
y_pred = (y_pred == -1).astype(int)

# Calculate metrics
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
contamination_rate = y_pred.sum() / len(y_pred)

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")
print(f"Contamination rate: {contamination_rate:.2f}")

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 1. Capítulo 3

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Suggested prompts:

Can you explain what the contamination parameter does in IsolationForest?

How do I interpret the precision, recall, and F1-score values in this context?

What happens if I change the number of outliers or the contamination rate?

Deslize para mostrar o menu

To evaluate outlier detection models, focus on precision, recall, F1-score, and contamination rate:

Precision: proportion of detected outliers that are truly outliers;
Recall: proportion of true outliers your model finds;
F1-score: harmonic mean of precision and recall, balancing their trade-off;
Contamination rate: fraction of points labeled as outliers, often set in advance due to the rarity of true outliers.

Note


              12345678910111213141516171819202122232425262728
            
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.ensemble import IsolationForest
from sklearn.metrics import precision_score, recall_score, f1_score

# Create synthetic data: 2 clusters, with some outliers
X, _ = make_blobs(n_samples=200, centers=1, cluster_std=0.60, random_state=42)
outliers = np.random.uniform(low=-6, high=6, size=(20, 2))
X_full = np.vstack([X, outliers])
y_true = np.zeros(X_full.shape[0], dtype=int)
y_true[-20:] = 1  # Mark last 20 as true outliers

# Fit Isolation Forest
model = IsolationForest(contamination=20 / 220, random_state=42)
y_pred = model.fit_predict(X_full)
# IsolationForest: -1 for outlier, 1 for inlier
y_pred = (y_pred == -1).astype(int)

# Calculate metrics
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
contamination_rate = y_pred.sum() / len(y_pred)

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")
print(f"Contamination rate: {contamination_rate:.2f}")

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 1. Capítulo 3