Oppiskele Metrics for Outlier Detection | Foundations of Outlier and Novelty Detection

To evaluate outlier detection models, focus on precision, recall, F1-score, and contamination rate:

Precision: proportion of detected outliers that are truly outliers;
Recall: proportion of true outliers your model finds;
F1-score: harmonic mean of precision and recall, balancing their trade-off;
Contamination rate: fraction of points labeled as outliers, often set in advance due to the rarity of true outliers.

If you label many points as outliers, recall increases but precision drops due to more false positives. Labeling fewer points increases precision but lowers recall. The F1-score shows the balance between these extremes. Contamination rate lets you control how many points your model flags, which is crucial when true outliers are rare.

Note

There is always a trade-off between precision and recall, especially in rare event detection like outlier analysis. Increasing recall often decreases precision, and vice versa. Choosing the right balance depends on your application: in fraud detection, missing a true fraud (low recall) may be worse than investigating a few false alarms (lower precision), while in medical diagnostics, too many false positives can overwhelm resources.


              12345678910111213141516171819202122232425262728
            
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.ensemble import IsolationForest
from sklearn.metrics import precision_score, recall_score, f1_score

# Create synthetic data: 2 clusters, with some outliers
X, _ = make_blobs(n_samples=200, centers=1, cluster_std=0.60, random_state=42)
outliers = np.random.uniform(low=-6, high=6, size=(20, 2))
X_full = np.vstack([X, outliers])
y_true = np.zeros(X_full.shape[0], dtype=int)
y_true[-20:] = 1  # Mark last 20 as true outliers

# Fit Isolation Forest
model = IsolationForest(contamination=20 / 220, random_state=42)
y_pred = model.fit_predict(X_full)
# IsolationForest: -1 for outlier, 1 for inlier
y_pred = (y_pred == -1).astype(int)

# Calculate metrics
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
contamination_rate = y_pred.sum() / len(y_pred)

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")
print(f"Contamination rate: {contamination_rate:.2f}")

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 1. Luku 3

Kysy tekoälyä

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Pyyhkäise näyttääksesi valikon

To evaluate outlier detection models, focus on precision, recall, F1-score, and contamination rate:

Precision: proportion of detected outliers that are truly outliers;
Recall: proportion of true outliers your model finds;
F1-score: harmonic mean of precision and recall, balancing their trade-off;
Contamination rate: fraction of points labeled as outliers, often set in advance due to the rarity of true outliers.

Note


              12345678910111213141516171819202122232425262728
            
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.ensemble import IsolationForest
from sklearn.metrics import precision_score, recall_score, f1_score

# Create synthetic data: 2 clusters, with some outliers
X, _ = make_blobs(n_samples=200, centers=1, cluster_std=0.60, random_state=42)
outliers = np.random.uniform(low=-6, high=6, size=(20, 2))
X_full = np.vstack([X, outliers])
y_true = np.zeros(X_full.shape[0], dtype=int)
y_true[-20:] = 1  # Mark last 20 as true outliers

# Fit Isolation Forest
model = IsolationForest(contamination=20 / 220, random_state=42)
y_pred = model.fit_predict(X_full)
# IsolationForest: -1 for outlier, 1 for inlier
y_pred = (y_pred == -1).astype(int)

# Calculate metrics
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
contamination_rate = y_pred.sum() / len(y_pred)

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")
print(f"Contamination rate: {contamination_rate:.2f}")

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 1. Luku 3