Course Content

Cluster Analysis

## Cluster Analysis

# DBSCAN Clustering

**DBSCAN (Density-Based Spatial Clustering of Applications with Noise)** is a popular density-based algorithm in machine learning and data mining. It is particularly useful for identifying clusters of arbitrary shape in data containing noise or outliers. The algorithm works by defining clusters as dense regions of points separated by less dense regions.
The DBSCAN algorithm takes two main parameters: **epsilon (eps)** and **min_samples**. Epsilon defines the radius around each point, and min_samples defines the minimum number of points required to form a dense region.

The algorithm works with three different types of points:

**core point**has at least min_samples points in its eps radius;**border point**has less than min_samples points in its eps radius, but at least one of these points inside the radius is the core point;**noise point**is the same as the border point but has no core point in its eps radius.

By classifying points as core, border, or noise, the DBSCAN algorithm can identify and separate clusters from noise points.

To understand DBSCAN, we also need to mention two definitions:

**Reachability** in terms of density establishes a point to be reachable from another if it lies within a certain distance (eps) from it. But reachability was also used in Mean-shift clustering: in previous chapters, we verified that this algorithm couldn't work with arbitrary-shaped clusters.

To detect clusters with the arbitrary shape we need **connectivity**: it involves a transitivity-based approach to determine whether points are located in a particular cluster. For example, p and q points could be connected if q->p1->p2->p3->p, where x->y means y is reachable from x.

To implement DBSCAN, you have to perform the following steps:

**Step 1**. Classify the points as core, boundary, and noise;

**Step 2**. Discard noise(or assign noise into separate cluster);

**Step 3**. Assign cluster to a core point;

**Step 4**. Merge all the connected core points of a chosen in step 3 point with this point's cluster;

**Step 5**. Visit the next unclustered core point and repeat steps 3-4 till there are no unclustered core points in the dataset;

**Step 6**. Merge boundary points with the nearest core point cluster.

Let's look at how to implement DBSCAN in Python:

`from sklearn.datasets import make_blobs import matplotlib.pyplot as plt from sklearn.cluster import DBSCAN # Create blobs dataset for clustering X, y = make_blobs(n_samples=500, centers=3) # Specify eps and min_samples paremeters of DBSCAN model and train model clustering = DBSCAN(eps=0.85, min_samples=20).fit(X) # Visualize results fig, axes = plt.subplots(1, 2) axes[0].scatter(X[:, 0], X[:, 1], c=y, cmap='tab20b') axes[0].set_title('Real clusters') axes[1].scatter(X[:, 0], X[:, 1], c=clustering.labels_, cmap='tab20b') axes[1].set_title('Clusters with DBSCAN') print('Clustering labels are: ', set(clustering.labels_))`

Note

DBSCAN algorithm allocates noise to a separate cluster with -1 label.

In the code above, we used `DBSCAN`

class from `sklearn.cluster`

module to implement DBSCAN algorithm. We also specified two parameters:

`eps`

parameter determines the reachability radius;`min_samples`

parameter determines the minimum amount of samples in reachability radius to classify points as core, border, or noise;- we use
`.fit()`

method to provide clustering.

Note

`DBSCAN`

class has no implementation for`.predict()`

method, so we can not provide clustering for new points without retraining the model.

Thanks for your feedback!