Clustering Metrics
Evaluating clustering algorithms presents unique challenges because, unlike supervised learning, there are no ground truth labels to compare predictions against. This makes it essential to use specialized clustering evaluation metrics that assess how well the algorithm has grouped the data. These metrics help you understand whether clusters are compact, well-separated, and meaningful for your application, even in the absence of labeled data.
Several key metrics are commonly used to evaluate clustering quality. Each provides a different perspective on the clustering structure:
Inertia (Within-Cluster Sum of Squares):
Inertia=i=1∑nμj∈Cmin∥xi−μj∥2where xi is a data point, μj is the centroid of cluster j, and C is the set of all centroids.
Silhouette Score:
s(i)=max{a(i),b(i)}b(i)−a(i)where a(i) is the mean distance between xi and all other points in the same cluster, and b(i) is the lowest mean distance between xi and all points in any other cluster.
Davies–Bouldin Index:
DB=k1i=1∑kj=imax(dijsi+sj)where si is the average distance of all points in cluster i to its centroid, and dij is the distance between centroids i and j.
Calinski–Harabasz Index:
CH=Tr(Wk)Tr(Bk)⋅k−1n−kwhere Tr(Bk) is the between-cluster dispersion, Tr(Wk) is the within-cluster dispersion, n is the number of samples, and k is the number of clusters.
123456789101112131415161718192021222324252627from sklearn.datasets import make_blobs from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score # Generate synthetic data X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) # Fit KMeans clustering kmeans = KMeans(n_clusters=4, random_state=0) labels = kmeans.fit_predict(X) # Compute Inertia inertia = kmeans.inertia_ # Compute Silhouette Score silhouette = silhouette_score(X, labels) # Compute Davies–Bouldin Index db_index = davies_bouldin_score(X, labels) # Compute Calinski–Harabasz Index ch_index = calinski_harabasz_score(X, labels) print(f"Inertia: {inertia:.2f}") print(f"Silhouette Score: {silhouette:.2f}") print(f"Davies–Bouldin Index: {db_index:.2f}") print(f"Calinski–Harabasz Index: {ch_index:.2f}")
Interpreting clustering metrics helps you assess clustering quality and select the best solution for your data:
-
Inertia: lower values mean more compact clusters. Use inertia to compare models with the same number of clusters or to find the "elbow point" where adding more clusters gives little improvement; do not rely on it alone, as it always decreases with more clusters.
-
Silhouette Score: ranges from -1 to 1, with higher values indicating better-separated clusters. Use it to compare different clustering solutions or pick the optimal number of clusters, but note it may be biased if clusters differ greatly in size or density.
-
Davies-Bouldin Index: lower values reflect better clustering with well-separated, compact clusters. It provides a quick summary but can be sensitive to varying cluster shapes and densities.
-
Calinski-Harabasz Index: higher values show better separation between clusters. It works best when clusters are similar in size and shape, but may not perform well with irregular or overlapping clusters.
Use multiple metrics together and visualize your clusters to get a complete, reliable evaluation.
Дякуємо за ваш відгук!
Запитати АІ
Запитати АІ
Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат
Can you explain how to choose the best number of clusters using these metrics?
What are some limitations of these clustering evaluation metrics?
Can you suggest ways to visualize clustering results for better interpretation?
Awesome!
Completion rate improved to 6.25
Clustering Metrics
Свайпніть щоб показати меню
Evaluating clustering algorithms presents unique challenges because, unlike supervised learning, there are no ground truth labels to compare predictions against. This makes it essential to use specialized clustering evaluation metrics that assess how well the algorithm has grouped the data. These metrics help you understand whether clusters are compact, well-separated, and meaningful for your application, even in the absence of labeled data.
Several key metrics are commonly used to evaluate clustering quality. Each provides a different perspective on the clustering structure:
Inertia (Within-Cluster Sum of Squares):
Inertia=i=1∑nμj∈Cmin∥xi−μj∥2where xi is a data point, μj is the centroid of cluster j, and C is the set of all centroids.
Silhouette Score:
s(i)=max{a(i),b(i)}b(i)−a(i)where a(i) is the mean distance between xi and all other points in the same cluster, and b(i) is the lowest mean distance between xi and all points in any other cluster.
Davies–Bouldin Index:
DB=k1i=1∑kj=imax(dijsi+sj)where si is the average distance of all points in cluster i to its centroid, and dij is the distance between centroids i and j.
Calinski–Harabasz Index:
CH=Tr(Wk)Tr(Bk)⋅k−1n−kwhere Tr(Bk) is the between-cluster dispersion, Tr(Wk) is the within-cluster dispersion, n is the number of samples, and k is the number of clusters.
123456789101112131415161718192021222324252627from sklearn.datasets import make_blobs from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score # Generate synthetic data X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) # Fit KMeans clustering kmeans = KMeans(n_clusters=4, random_state=0) labels = kmeans.fit_predict(X) # Compute Inertia inertia = kmeans.inertia_ # Compute Silhouette Score silhouette = silhouette_score(X, labels) # Compute Davies–Bouldin Index db_index = davies_bouldin_score(X, labels) # Compute Calinski–Harabasz Index ch_index = calinski_harabasz_score(X, labels) print(f"Inertia: {inertia:.2f}") print(f"Silhouette Score: {silhouette:.2f}") print(f"Davies–Bouldin Index: {db_index:.2f}") print(f"Calinski–Harabasz Index: {ch_index:.2f}")
Interpreting clustering metrics helps you assess clustering quality and select the best solution for your data:
-
Inertia: lower values mean more compact clusters. Use inertia to compare models with the same number of clusters or to find the "elbow point" where adding more clusters gives little improvement; do not rely on it alone, as it always decreases with more clusters.
-
Silhouette Score: ranges from -1 to 1, with higher values indicating better-separated clusters. Use it to compare different clustering solutions or pick the optimal number of clusters, but note it may be biased if clusters differ greatly in size or density.
-
Davies-Bouldin Index: lower values reflect better clustering with well-separated, compact clusters. It provides a quick summary but can be sensitive to varying cluster shapes and densities.
-
Calinski-Harabasz Index: higher values show better separation between clusters. It works best when clusters are similar in size and shape, but may not perform well with irregular or overlapping clusters.
Use multiple metrics together and visualize your clusters to get a complete, reliable evaluation.
Дякуємо за ваш відгук!