Вивчайте Clustering Metrics | Unsupervised Learning Metrics

Evaluating clustering algorithms presents unique challenges because, unlike supervised learning, there are no ground truth labels to compare predictions against. This makes it essential to use specialized clustering evaluation metrics that assess how well the algorithm has grouped the data. These metrics help you understand whether clusters are compact, well-separated, and meaningful for your application, even in the absence of labeled data.

Several key metrics are commonly used to evaluate clustering quality. Each provides a different perspective on the clustering structure:

Inertia (Within-Cluster Sum of Squares):

\text{Inertia} = \sum_{i=1}^{n} \min_{\mu_j \in C} \| x_i - \mu_j \|^2

where $x_i$ is a data point, $\mu_j$ is the centroid of cluster $j$ , and $C$ is the set of all centroids.

Silhouette Score:

s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}}

where $a(i)$ is the mean distance between $x_i$ and all other points in the same cluster, and $b(i)$ is the lowest mean distance between $x_i$ and all points in any other cluster.

Davies–Bouldin Index:

\text{DB} = \frac{1}{k} \sum_{i=1}^{k} \max_{j \ne i} \left( \frac{s_i + s_j}{d_{ij}} \right)

where $s_i$ is the average distance of all points in cluster $i$ to its centroid, and $d_{ij}$ is the distance between centroids $i$ and $j$ .

Calinski–Harabasz Index:

\text{CH} = \frac{\text{Tr}(B_k)}{\text{Tr}(W_k)} \cdot \frac{n - k}{k - 1}

where $Tr(B_k)$ is the between-cluster dispersion, $Tr(W_k)$ is the within-cluster dispersion, $n$ is the number of samples, and $k$ is the number of clusters.


              123456789101112131415161718192021222324252627
            
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Fit KMeans clustering
kmeans = KMeans(n_clusters=4, random_state=0)
labels = kmeans.fit_predict(X)

# Compute Inertia
inertia = kmeans.inertia_

# Compute Silhouette Score
silhouette = silhouette_score(X, labels)

# Compute Davies–Bouldin Index
db_index = davies_bouldin_score(X, labels)

# Compute Calinski–Harabasz Index
ch_index = calinski_harabasz_score(X, labels)

print(f"Inertia: {inertia:.2f}")
print(f"Silhouette Score: {silhouette:.2f}")
print(f"Davies–Bouldin Index: {db_index:.2f}")
print(f"Calinski–Harabasz Index: {ch_index:.2f}")

Interpreting clustering metrics helps you assess clustering quality and select the best solution for your data:

Inertia: lower values mean more compact clusters. Use inertia to compare models with the same number of clusters or to find the "elbow point" where adding more clusters gives little improvement; do not rely on it alone, as it always decreases with more clusters.
Silhouette Score: ranges from -1 to 1, with higher values indicating better-separated clusters. Use it to compare different clustering solutions or pick the optimal number of clusters, but note it may be biased if clusters differ greatly in size or density.
Davies-Bouldin Index: lower values reflect better clustering with well-separated, compact clusters. It provides a quick summary but can be sensitive to varying cluster shapes and densities.
Calinski-Harabasz Index: higher values show better separation between clusters. It works best when clusters are similar in size and shape, but may not perform well with irregular or overlapping clusters.

Use multiple metrics together and visualize your clusters to get a complete, reliable evaluation.

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 3. Розділ 2

Запитати АІ

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Suggested prompts:

Can you explain how to choose the best number of clusters using these metrics?

What are some limitations of these clustering evaluation metrics?

Can you suggest ways to visualize clustering results for better interpretation?

Awesome!

Completion rate improved to 6.25

Свайпніть щоб показати меню