Course Content
Cluster Analysis
Cluster Analysis
Internal Evaluation
As in any machine learning task, we rely on specific metrics to evaluate the quality of clustering: one of the classes of such metrics are internal metrics.
Internal metrics in clustering are used to evaluate the quality of clustering results based on the data without using any external information or labels. These metrics provide a quantitative measure of how well a clustering algorithm has grouped the data points into clusters based on the intrinsic characteristics of the data: especially intra- and inter-cluster distances.
Most commonly used internal metrics
Silhouette score measures how well a data point fits into its assigned cluster compared to other clusters.
The silhouette score is calculated as follows:
Step 1. For each data point, calculate two metrics:
- a: The average distance between the data point and all other points in the same cluster;
- b: The average distance between the data point and all points in the nearest cluster (i.e., the cluster that is most similar to the data point).
Step 2. Calculate the silhouette score for each data point using the following formula:
silhouette score = (b - a) / max(a, b)
Step 3. Calculate the overall silhouette score for the clustering by taking the average of all the scores for all points.
import matplotlib.pyplot as plt from sklearn.metrics import silhouette_score from sklearn.cluster import KMeans from sklearn.datasets import make_moons, make_blobs, make_circles import warnings warnings.filterwarnings('ignore') # Create subplots for visualizations fig, axes = plt.subplots(1, 3) fig.set_size_inches(15, 5) # Adjusted figure size # Load circles dataset X_circles, y = make_circles(n_samples=500, factor=0.2) # Provide K-means clustering for circles dataset clustering_circles = KMeans(n_clusters=2).fit(X_circles) # Provide visualization and show silhouette for circles dataset axes[0].scatter(X_circles[:, 0], X_circles[:, 1], c=clustering_circles.labels_, cmap='tab20b') axes[0].set_title('Silhouette is: ' + str(round(silhouette_score(X_circles, clustering_circles.labels_), 3))) # Load blobs dataset X_blobs, y = make_blobs(n_samples=500, centers=2) # Provide K-means clustering for blobs dataset clustering_blobs = KMeans(n_clusters=2).fit(X_blobs) # Provide visualization and show silhouette for blobs dataset axes[1].scatter(X_blobs[:, 0], X_blobs[:, 1], c=clustering_blobs.labels_, cmap='tab20b') axes[1].set_title('Silhouette is: ' + str(round(silhouette_score(X_blobs, clustering_blobs.labels_), 3))) # Load moons dataset X_moons, y = make_moons(n_samples=500) # Provide K-means clustering for moons dataset clustering_moons = KMeans(n_clusters=2).fit(X_moons) # Provide visualization and show silhouette for moons dataset axes[2].scatter(X_moons[:, 0], X_moons[:, 1], c=clustering_moons.labels_, cmap='tab20b') axes[2].set_title('Silhouette is: ' + str(round(silhouette_score(X_moons, clustering_moons.labels_), 3))) # Display the plots plt.show()
The silhouette coefficient varies between -1 to 1, with:
-
-1 indicating that the data point isn’t assigned to the right cluster;
-
0 indicating that the clusters are overlapping;
-
1 indicates that the cluster is dense and well-separated (thus the desirable value).
The Davies-Bouldin Index (DBI) is an internal clustering evaluation metric that measures the quality of clustering by considering both the separation between clusters and the compactness of clusters.
The DBI is calculated as follows:
Step 1. For each cluster, calculate the average distance between its centroid and all the data points in that cluster. This measures the cluster's similarity;
Step 2. For each pair of clusters, calculate the distance between their centroids. This measures the dissimilarity between clusters;
Step 3. For each cluster, find the cluster with the greatest similarity to it (excluding itself);
Step 4. For each cluster, calculate the sum of the similarity between that cluster and its closest neighbor, and divide it by the number of data points in the cluster. This gives you the DBI score for that cluster;
Step 5. Finally, calculate the overall DBI score by taking the average DBI scores for all the clusters.
import matplotlib.pyplot as plt from sklearn.metrics import davies_bouldin_score from sklearn.cluster import KMeans from sklearn.datasets import make_moons, make_blobs, make_circles import warnings warnings.filterwarnings('ignore') fig, axes = plt.subplots(1, 3) fig.set_size_inches(10, 5) X_circles, y = make_circles(n_samples=500, factor=0.2) clustering = KMeans(n_clusters=2).fit(X_circles) # Provide visualization and show DBI for circles dataset axes[0].scatter(X_circles[:, 0], X_circles[:, 1], c=clustering.labels_, cmap='tab20b') axes[0].set_title('DBI is: '+ str(round(davies_bouldin_score(X_circles, clustering.labels_), 3))) X_blobs, y = make_blobs(n_samples=500, centers=2) clustering = KMeans(n_clusters=2).fit(X_blobs) # Provide visualization and show DBI for blobs dataset axes[1].scatter(X_blobs[:, 0], X_blobs[:, 1], c=clustering.labels_, cmap='tab20b') axes[1].set_title('DBI is: '+ str(round(davies_bouldin_score(X_blobs, clustering.labels_), 3))) X_moons, y = make_moons(n_samples=500) clustering = KMeans(n_clusters=2).fit(X_moons) # Provide visualization and show DBI for moons dataset axes[2].scatter(X_moons[:, 0], X_moons[:, 1], c=clustering.labels_, cmap='tab20b') axes[2].set_title('DBI is: '+ str(round(davies_bouldin_score(X_moons, clustering.labels_), 3)))
A lower DBI value indicates better clustering performance, indicating that the clusters are well-separated and compact.
Thanks for your feedback!