Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Internal Evaluation
course content

Course Content

Cluster Analysis

Internal EvaluationInternal Evaluation

As in any machine learning task, we rely on specific metrics to evaluate the quality of clustering: one of the classes of such metrics are internal metrics.

Internal metrics in clustering are used to evaluate the quality of clustering results based on the data without using any external information or labels. These metrics provide a quantitative measure of how well a clustering algorithm has grouped the data points into clusters based on the intrinsic characteristics of the data: especially intra- and inter-cluster distances.

Most commonly used internal metrics

Silhouette score measures how well a data point fits into its assigned cluster compared to other clusters.

The silhouette score is calculated as follows:

Step 1. For each data point, calculate two metrics:

  • a: The average distance between the data point and all other points in the same cluster.
  • b: The average distance between the data point and all points in the nearest cluster (i.e., the cluster that is most similar to the data point).

Step 2. Calculate the silhouette score for each data point using the following formula:

silhouette score = (b - a) / max(a, b)

Step 3. Calculate the overall silhouette score for the clustering by taking the average of all the scores for all points.

The silhouette coefficient varies between -1 to 1, with:

  • -1 indicating that the data point isn’t assigned to the right cluster;
  • 0 indicating that the clusters are overlapping;
  • 1 indicates that the cluster is dense and well-separated (thus the desirable value).

The Davies-Bouldin Index (DBI) is an internal clustering evaluation metric that measures the quality of clustering by considering both the separation between clusters and the compactness of clusters.

The DBI is calculated as follows:

Step 1. For each cluster, calculate the average distance between its centroid and all the data points in that cluster. This measures the cluster's similarity.

Step 2. For each pair of clusters, calculate the distance between their centroids. This measures the dissimilarity between clusters.

Step 3. For each cluster, find the cluster with the greatest similarity to it (excluding itself).

Step 4. For each cluster, calculate the sum of the similarity between that cluster and its closest neighbor, and divide it by the number of data points in the cluster. This gives you the DBI score for that cluster.

Step 5. Finally, calculate the overall DBI score by taking the average DBI scores for all the clusters.

A lower DBI value indicates better clustering performance, indicating that the clusters are well-separated and compact.

question-icon

For which of the metrics a zero value is a sign of good clustering quality?

Select the correct answer

Everything was clear?

Section 3. Chapter 1
some-alt