Course Content

Cluster Analysis

1. What is Clustering?

Formulation Of The Problem What Is Unique About Clustering?Types Of Clustering

2. Basic Clustering Algorithms

K-means Clustering Perform K-means Clustering Agglomerative Clustering Perform Agglomerative Clustering Mean Shift Clustering DBSCAN Clustering Perform DBSCAN Clustering Using Clustering On Real Data

3. How to choose the best model?

Internal Evaluation External Evaluation Quality Evaluation

Internal Evaluation

As in any machine learning task, we rely on specific metrics to evaluate the quality of clustering: one of the classes of such metrics are internal metrics.

Internal metrics in clustering are used to evaluate the quality of clustering results based on the data without using any external information or labels. These metrics provide a quantitative measure of how well a clustering algorithm has grouped the data points into clusters based on the intrinsic characteristics of the data: especially intra- and inter-cluster distances.

Most commonly used internal metrics

Silhouette score measures how well a data point fits into its assigned cluster compared to other clusters.

The silhouette score is calculated as follows:

Step 1. For each data point, calculate two metrics:

a: The average distance between the data point and all other points in the same cluster;
b: The average distance between the data point and all points in the nearest cluster (i.e., the cluster that is most similar to the data point).

Step 2. Calculate the silhouette score for each data point using the following formula:

silhouette score = (b - a) / max(a, b)

Step 3. Calculate the overall silhouette score for the clustering by taking the average of all the scores for all points.


              1234567891011121314151617181920212223242526272829303132333435363738394041424344
            
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
from sklearn.datasets import make_moons, make_blobs, make_circles 
import warnings

warnings.filterwarnings('ignore')

# Create subplots for visualizations
fig, axes = plt.subplots(1, 3)
fig.set_size_inches(15, 5)  # Adjusted figure size

# Load circles dataset
X_circles, y = make_circles(n_samples=500, factor=0.2)

# Provide K-means clustering for circles dataset
clustering_circles = KMeans(n_clusters=2).fit(X_circles)

# Provide visualization and show silhouette for circles dataset
axes[0].scatter(X_circles[:, 0], X_circles[:, 1], c=clustering_circles.labels_, cmap='tab20b')
axes[0].set_title('Silhouette is: ' + str(round(silhouette_score(X_circles, clustering_circles.labels_), 3)))

# Load blobs dataset
X_blobs, y = make_blobs(n_samples=500, centers=2)

# Provide K-means clustering for blobs dataset
clustering_blobs = KMeans(n_clusters=2).fit(X_blobs)

# Provide visualization and show silhouette for blobs dataset
axes[1].scatter(X_blobs[:, 0], X_blobs[:, 1], c=clustering_blobs.labels_, cmap='tab20b')
axes[1].set_title('Silhouette is: ' + str(round(silhouette_score(X_blobs, clustering_blobs.labels_), 3)))

# Load moons dataset
X_moons, y = make_moons(n_samples=500)

# Provide K-means clustering for moons dataset
clustering_moons = KMeans(n_clusters=2).fit(X_moons)

# Provide visualization and show silhouette for moons dataset
axes[2].scatter(X_moons[:, 0], X_moons[:, 1], c=clustering_moons.labels_, cmap='tab20b')
axes[2].set_title('Silhouette is: ' + str(round(silhouette_score(X_moons, clustering_moons.labels_), 3)))

# Display the plots
plt.show()

The silhouette coefficient varies between -1 to 1, with:

-1 indicating that the data point isn’t assigned to the right cluster;
0 indicating that the clusters are overlapping;
1 indicates that the cluster is dense and well-separated (thus the desirable value).

The Davies-Bouldin Index (DBI) is an internal clustering evaluation metric that measures the quality of clustering by considering both the separation between clusters and the compactness of clusters.

The DBI is calculated as follows:

Step 1. For each cluster, calculate the average distance between its centroid and all the data points in that cluster. This measures the cluster's similarity;

Step 2. For each pair of clusters, calculate the distance between their centroids. This measures the dissimilarity between clusters;

Step 3. For each cluster, find the cluster with the greatest similarity to it (excluding itself);

Step 4. For each cluster, calculate the sum of the similarity between that cluster and its closest neighbor, and divide it by the number of data points in the cluster. This gives you the DBI score for that cluster;

Step 5. Finally, calculate the overall DBI score by taking the average DBI scores for all the clusters.


              12345678910111213141516171819202122232425262728
            
import matplotlib.pyplot as plt
from sklearn.metrics import davies_bouldin_score
from sklearn.cluster import KMeans
from sklearn.datasets import make_moons, make_blobs, make_circles 
import warnings

warnings.filterwarnings('ignore')

fig, axes = plt.subplots(1, 3)
fig.set_size_inches(10, 5)

X_circles, y = make_circles(n_samples=500, factor=0.2)
clustering = KMeans(n_clusters=2).fit(X_circles)
# Provide visualization and show DBI for circles dataset
axes[0].scatter(X_circles[:, 0], X_circles[:, 1], c=clustering.labels_, cmap='tab20b')
axes[0].set_title('DBI is: '+ str(round(davies_bouldin_score(X_circles, clustering.labels_), 3)))

X_blobs, y = make_blobs(n_samples=500, centers=2)
clustering = KMeans(n_clusters=2).fit(X_blobs)
# Provide visualization and show DBI for blobs dataset
axes[1].scatter(X_blobs[:, 0], X_blobs[:, 1], c=clustering.labels_, cmap='tab20b')
axes[1].set_title('DBI is: '+ str(round(davies_bouldin_score(X_blobs, clustering.labels_), 3)))

X_moons, y = make_moons(n_samples=500)
clustering = KMeans(n_clusters=2).fit(X_moons)
# Provide visualization and show DBI for moons dataset
axes[2].scatter(X_moons[:, 0], X_moons[:, 1], c=clustering.labels_, cmap='tab20b')
axes[2].set_title('DBI is: '+ str(round(davies_bouldin_score(X_moons, clustering.labels_), 3)))

A lower DBI value indicates better clustering performance, indicating that the clusters are well-separated and compact.

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat