Contenido del Curso
Cluster Analysis
Cluster Analysis
Agglomerative Clustering
Agglomerative clustering is a hierarchical clustering algorithm used in machine learning and data mining: it groups similar data points into nested clusters based on their pairwise distances. The algorithm consists of three different steps:
Step 1. Make each data point a cluster;
Step 2. Take the two closest clusters and make them one cluster;
Step 3. Repeat step 2 until there is only one cluster;
But how to carry out clustering if the output is always one cluster? Unlike other clustering algorithms that require the number of clusters to be specified in advance, hierarchical clustering produces a hierarchical tree-like structure called a dendrogram that allows the number of clusters to be determined after the clustering process.
Dendrogram is a diagram that shows the hierarchical relationship between objects Thus, using the dendrogram, we can track with what sequence which clusters are combined during the execution of the algorithm. And it is with the help of the dendrogram that the optimal number of clusters into which our data can be divided is determined:
- Identify the longest vertical line: Look for the longest line that doesn't cross horizontal lines in the dendrogram. This represents the largest distance between any two clusters in the dataset;
- Draw a horizontal line: Draw a horizontal line through the longest vertical line identified in step 2(the red dotted line in the image below). This line is drawn at a height where the dendrogram branches are relatively long and clear, indicating a natural separation between clusters;
- The number of clusters will be determined by the number of times the horizontal line intersects with the dendrogram.
Note
Even with a dendrogram, there may be several options for splitting data into clusters: the most appropriate number of clusters can be defined by combing a dendrogram with knowledge from the domain area.
Note
In the agglomerative algorithm, we can also pre-set the number of clusters. In this case, the dendrogram will be divided into clusters in such a way that the number of clusters in the result is the same as the a priori given.
A very important parameter of the algorithm is linkage: the method used to calculate the distance between two clusters. There are several methods of linkage that can be used, including:
- Single Linkage: This method calculates the distance between two clusters as the shortest distance between any two points in the two clusters;
- Complete Linkage: This method calculates the distance between two clusters as the longest distance between any two points in the two clusters;
- Average Linkage: This method calculates the distance between two clusters as the average distance between all pairs of points in the two clusters.
Finally, let's look at the Python implementation of agglomerative clustering:
from sklearn.cluster import AgglomerativeClustering import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_blobs # Firstly, we will create our dataset X, y = make_blobs(n_samples=500, cluster_std=1, centers=4, random_state=170) transformation = [[0.6, -0.6], [-0.4, 0.8]] X_aniso = np.matmul(X, transformation) # In this line we will specify parameters of our Agglomerative model agglomerative = AgglomerativeClustering(linkage='single', distance_threshold=0.6, n_clusters=None) # Training our model agglomerative.fit(X_aniso) # Providing visualization of results fig, axes=plt.subplots(1,2) axes[0].scatter(X[:, 0], X[:, 1], c=agglomerative.labels_, s=50, cmap='tab20b') axes[0].set_title('Clustered data') axes[1].scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='tab20b') axes[1].set_title('Real data')
In the code above, we use AgglomerativeClustering
class to create a model, where:
linkage
parameter determines the linkage type:'complete'
,'average'
,'single'
;distance_threshold
is the linkage distance threshold at or above which clusters will not be merged.
Note
AgglomerativeClustering class has no implementation for
.predict()
method: we have to train the model every time we want to cluster new data.
¡Gracias por tus comentarios!