Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
External Evaluation | How to choose the best model?
Cluster Analysis
course content

Contenido del Curso

Cluster Analysis

Cluster Analysis

1. What is Clustering?
2. Basic Clustering Algorithms
3. How to choose the best model?

bookExternal Evaluation

External evaluation for clustering algorithms is a method of evaluating the performance of a clustering algorithm by comparing its results to a known set of class labels or ground truth. In other words, the algorithm's clusters are compared to a set of pre-existing labels created by experts or based on domain knowledge.

Most commonly used external metrics

The Rand Index (RI) measures the similarity between two clusterings or partitions and is often used as an external evaluation metric in clustering. The Rand Index measures the percentage of pairs of data points assigned to the same cluster in both the predicted and true clusterings, normalized by the total number of data point pairs.

The Rand Index is calculated as follows:

  • Let n be the total number of data points;
  • Let a be the number of pairs of data points assigned to the same cluster in both the predicted and true clusterings;
  • Let b be the number of pairs of data points assigned to different clusters in both the predicted and true clustering.

The Rand Index is then given by 2*(a+b)/ (n*(n-1)).

123456789101112131415161718192021222324252627282930313233
from sklearn.metrics import rand_score from sklearn.cluster import KMeans from sklearn.datasets import make_moons, make_blobs, make_circles import matplotlib.pyplot as plt import warnings warnings.filterwarnings('ignore') # Creating subplots for visualization fig, axes = plt.subplots(1, 3) fig.set_size_inches(10, 5) # Create circles dataset X_circles, y = make_circles(n_samples=500, factor=0.2) # Provide K-means clustering clustering = KMeans(n_clusters=2).fit(X_circles) predicted_circles = clustering.predict(X_circles) # Provide visualization and show RI for circles dataset axes[0].scatter(X_circles[:, 0], X_circles[:, 1], c=clustering.labels_, cmap='tab20b') axes[0].set_title('RI is: '+ str(round(rand_score(y, predicted_circles), 3))) X_blobs, y = make_blobs(n_samples=500, centers=2) clustering = KMeans(n_clusters=2).fit(X_blobs) predicted_blobs = clustering.predict(X_blobs) # Provide visualization and show RI for blobs dataset axes[1].scatter(X_blobs[:, 0], X_blobs[:, 1], c=clustering.labels_, cmap='tab20b') axes[1].set_title('RI is: '+ str(round(rand_score(y, predicted_blobs), 3))) X_moons, y = make_moons(n_samples=500) clustering = KMeans(n_clusters=2).fit(X_moons) predicted_moons = clustering.predict(X_moons) # Provide visualization and show RI for moons dataset axes[2].scatter(X_moons[:, 0], X_moons[:, 1], c=clustering.labels_, cmap='tab20b') axes[2].set_title('RI is: '+ str(round(rand_score(y, predicted_moons), 3)))
copy

The Rand Index can vary between 0 and 1, where 0 indicates that the two clusterings are completely different, and 1 indicates that the two clusterings are identical.


Mutual Information (MI) measures the amount of information shared by the predicted and true clusterings based on the concept of entropy. We will not consider how this metric is calculated, as this is outside the scope of the beginner-level course.

1234567891011121314151617181920212223242526272829303132
from sklearn.metrics import mutual_info_score from sklearn.cluster import KMeans from sklearn.datasets import make_moons, make_blobs, make_circles import matplotlib.pyplot as plt import warnings warnings.filterwarnings('ignore') # Create subplots for visualizations fig, axes = plt.subplots(1, 3) fig.set_size_inches(10 ,5) X_circles, y = make_circles(n_samples=500, factor=0.2) clustering = KMeans(n_clusters=2).fit(X_circles) predicted_circles = clustering.predict(X_circles) # Provide visualization and show MI for circles dataset axes[0].scatter(X_circles[:, 0], X_circles[:, 1], c=clustering.labels_, cmap='tab20b') axes[0].set_title('MI is: '+ str(round(mutual_info_score(y, predicted_circles), 3))) X_blobs, y = make_blobs(n_samples=500, centers=2) clustering = KMeans(n_clusters=2).fit(X_blobs) predicted_blobs = clustering.predict(X_blobs) # Provide visualization and show MI for blobs dataset axes[1].scatter(X_blobs[:, 0], X_blobs[:, 1], c=clustering.labels_, cmap='tab20b') axes[1].set_title('MI is: '+ str(round(mutual_info_score(y, predicted_blobs), 3))) X_moons, y = make_moons(n_samples=500) clustering = KMeans(n_clusters=2).fit(X_moons) predicted_moons = clustering.predict(X_moons) # Provide visualization and show MI for moons dataset axes[2].scatter(X_moons[:, 0], X_moons[:, 1], c=clustering.labels_, cmap='tab20b') axes[2].set_title('MI is: '+ str(round(mutual_info_score(y, predicted_moons), 3)))
copy

The Mutual Information varies between 0 and 1, where 0 indicates that the predicted clustering is completely different from the true clustering, and 1 indicates that the predicted clustering is identical to the true clustering. Furthermore, based on the above examples, we can say that this metric is much better at detecting bad clustering than the Rand Index.


Homogeneity measures the degree to which each cluster contains only data points that belong to a single class or category based on conditional entropy. Just like with mutual information, we will not consider the principle of calculating this metric.

12345678910111213141516171819202122232425262728293031
from sklearn.metrics import homogeneity_score from sklearn.cluster import KMeans from sklearn.datasets import make_moons, make_blobs, make_circles import matplotlib.pyplot as plt import warnings warnings.filterwarnings('ignore') fig, axes = plt.subplots(1, 3) fig.set_size_inches(10, 5) X_circles, y = make_circles(n_samples=500, factor=0.2) clustering = KMeans(n_clusters=2).fit(X_circles) predicted_circles = clustering.predict(X_circles) # Provide visualization and show homogeneity for circles dataset axes[0].scatter(X_circles[:, 0], X_circles[:, 1], c=clustering.labels_, cmap='tab20b') axes[0].set_title('Homogeneity is: '+ str(round(homogeneity_score(y, predicted_circles), 3))) X_blobs, y = make_blobs(n_samples=500, centers=2) clustering = KMeans(n_clusters=2).fit(X_blobs) predicted_blobs = clustering.predict(X_blobs) # Provide visualization and show homogeneity for blobs dataset axes[1].scatter(X_blobs[:, 0], X_blobs[:, 1], c=clustering.labels_, cmap='tab20b') axes[1].set_title('Homogeneity is: '+ str(round(homogeneity_score(y, predicted_blobs), 3))) X_moons, y = make_moons(n_samples=500) clustering = KMeans(n_clusters=2).fit(X_moons) predicted_moons = clustering.predict(X_moons) # Provide visualization and show homogeneity for moons dataset axes[2].scatter(X_moons[:, 0], X_moons[:, 1], c=clustering.labels_, cmap='tab20b') axes[2].set_title('Homogeneity is: '+ str(round(homogeneity_score(y, predicted_moons), 3)))
copy

A clustering solution is considered highly homogeneous if all the data points that belong to the same true class or category are grouped into the same cluster.
In other words, homogeneity measures the extent to which a clustering algorithm assigns data points to the correct clusters based on their true class or category. The homogeneity score ranges from 0 to 1, with 1 indicating perfect homogeneity.

Homogeneity is the best of all the considered metrics: it determines both good and bad clustering equally well, as shown in the example above.

Can we use external evaluation metrics if we have no information about real partitioning of data into clusters?

Can we use external evaluation metrics if we have no information about real partitioning of data into clusters?

Selecciona la respuesta correcta

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 3. Capítulo 2
some-alt