Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
DBSCAN Clustering | Basic Clustering Algorithms
Cluster Analysis
course content

Contenido del Curso

Cluster Analysis

Cluster Analysis

1. What is Clustering?
2. Basic Clustering Algorithms
3. How to choose the best model?

bookDBSCAN Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular density-based algorithm in machine learning and data mining. It is particularly useful for identifying clusters of arbitrary shape in data containing noise or outliers. The algorithm works by defining clusters as dense regions of points separated by less dense regions. The DBSCAN algorithm takes two main parameters: epsilon (eps) and min_samples. Epsilon defines the radius around each point, and min_samples defines the minimum number of points required to form a dense region.

The algorithm works with three different types of points:

  • core point has at least min_samples points in its eps radius;
  • border point has less than min_samples points in its eps radius, but at least one of these points inside the radius is the core point;
  • noise point is the same as the border point but has no core point in its eps radius.

By classifying points as core, border, or noise, the DBSCAN algorithm can identify and separate clusters from noise points.

To understand DBSCAN, we also need to mention two definitions:

Reachability in terms of density establishes a point to be reachable from another if it lies within a certain distance (eps) from it. But reachability was also used in Mean-shift clustering: in previous chapters, we verified that this algorithm couldn't work with arbitrary-shaped clusters.

To detect clusters with the arbitrary shape we need connectivity: it involves a transitivity-based approach to determine whether points are located in a particular cluster. For example, p and q points could be connected if q->p1->p2->p3->p, where x->y means y is reachable from x.

To implement DBSCAN, you have to perform the following steps:

Step 1. Classify the points as core, boundary, and noise;

Step 2. Discard noise(or assign noise into separate cluster);

Step 3. Assign cluster to a core point;

Step 4. Merge all the connected core points of a chosen in step 3 point with this point's cluster;

Step 5. Visit the next unclustered core point and repeat steps 3-4 till there are no unclustered core points in the dataset;

Step 6. Merge boundary points with the nearest core point cluster.

Let's look at how to implement DBSCAN in Python:

1234567891011121314
from sklearn.datasets import make_blobs import matplotlib.pyplot as plt from sklearn.cluster import DBSCAN # Create blobs dataset for clustering X, y = make_blobs(n_samples=500, centers=3) # Specify eps and min_samples paremeters of DBSCAN model and train model clustering = DBSCAN(eps=0.85, min_samples=20).fit(X) # Visualize results fig, axes = plt.subplots(1, 2) axes[0].scatter(X[:, 0], X[:, 1], c=y, cmap='tab20b') axes[0].set_title('Real clusters') axes[1].scatter(X[:, 0], X[:, 1], c=clustering.labels_, cmap='tab20b') axes[1].set_title('Clusters with DBSCAN') print('Clustering labels are: ', set(clustering.labels_))
copy

Note

DBSCAN algorithm allocates noise to a separate cluster with -1 label.

In the code above, we used DBSCAN class from sklearn.cluster module to implement DBSCAN algorithm. We also specified two parameters:

  • eps parameter determines the reachability radius;
  • min_samples parameter determines the minimum amount of samples in reachability radius to classify points as core, border, or noise;
  • we use .fit() method to provide clustering.

Note

DBSCAN class has no implementation for .predict() method, so we can not provide clustering for new points without retraining the model.

What makes the DBSCAN algorithm able to detect clusters of arbitrary shapes?

What makes the DBSCAN algorithm able to detect clusters of arbitrary shapes?

Selecciona la respuesta correcta

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 2. Capítulo 6
some-alt