Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Agglomerative Clustering | Basic Clustering Algorithms
Cluster Analysis
course content

Conteúdo do Curso

Cluster Analysis

Cluster Analysis

1. What is Clustering?
2. Basic Clustering Algorithms
3. How to choose the best model?

bookAgglomerative Clustering

Agglomerative clustering is a hierarchical clustering algorithm used in machine learning and data mining: it groups similar data points into nested clusters based on their pairwise distances. The algorithm consists of three different steps:

Step 1. Make each data point a cluster;

Step 2. Take the two closest clusters and make them one cluster;

Step 3. Repeat step 2 until there is only one cluster;

But how to carry out clustering if the output is always one cluster? Unlike other clustering algorithms that require the number of clusters to be specified in advance, hierarchical clustering produces a hierarchical tree-like structure called a dendrogram that allows the number of clusters to be determined after the clustering process.

Dendrogram is a diagram that shows the hierarchical relationship between objects Thus, using the dendrogram, we can track with what sequence which clusters are combined during the execution of the algorithm. And it is with the help of the dendrogram that the optimal number of clusters into which our data can be divided is determined:

  1. Identify the longest vertical line: Look for the longest line that doesn't cross horizontal lines in the dendrogram. This represents the largest distance between any two clusters in the dataset;
  2. Draw a horizontal line: Draw a horizontal line through the longest vertical line identified in step 2(the red dotted line in the image below). This line is drawn at a height where the dendrogram branches are relatively long and clear, indicating a natural separation between clusters;
  3. The number of clusters will be determined by the number of times the horizontal line intersects with the dendrogram.

Note

Even with a dendrogram, there may be several options for splitting data into clusters: the most appropriate number of clusters can be defined by combing a dendrogram with knowledge from the domain area.

Note

In the agglomerative algorithm, we can also pre-set the number of clusters. In this case, the dendrogram will be divided into clusters in such a way that the number of clusters in the result is the same as the a priori given.

A very important parameter of the algorithm is linkage: the method used to calculate the distance between two clusters. There are several methods of linkage that can be used, including:

  1. Single Linkage: This method calculates the distance between two clusters as the shortest distance between any two points in the two clusters;
  2. Complete Linkage: This method calculates the distance between two clusters as the longest distance between any two points in the two clusters;
  3. Average Linkage: This method calculates the distance between two clusters as the average distance between all pairs of points in the two clusters.

Finally, let's look at the Python implementation of agglomerative clustering:

123456789101112131415161718192021
from sklearn.cluster import AgglomerativeClustering import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_blobs # Firstly, we will create our dataset X, y = make_blobs(n_samples=500, cluster_std=1, centers=4, random_state=170) transformation = [[0.6, -0.6], [-0.4, 0.8]] X_aniso = np.matmul(X, transformation) # In this line we will specify parameters of our Agglomerative model agglomerative = AgglomerativeClustering(linkage='single', distance_threshold=0.6, n_clusters=None) # Training our model agglomerative.fit(X_aniso) # Providing visualization of results fig, axes=plt.subplots(1,2) axes[0].scatter(X[:, 0], X[:, 1], c=agglomerative.labels_, s=50, cmap='tab20b') axes[0].set_title('Clustered data') axes[1].scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='tab20b') axes[1].set_title('Real data')
copy

In the code above, we use AgglomerativeClustering class to create a model, where:

  • linkage parameter determines the linkage type: 'complete', 'average', 'single';
  • distance_threshold is the linkage distance threshold at or above which clusters will not be merged.

Note

AgglomerativeClustering class has no implementation for .predict() method: we have to train the model every time we want to cluster new data.

How does the number of clusters to split the data is determined in agglomerative algorithm?

How does the number of clusters to split the data is determined in agglomerative algorithm?

Selecione a resposta correta

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 2. Capítulo 3
We're sorry to hear that something went wrong. What happened?
some-alt