DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular density-based algorithm in machine learning and data mining. It is particularly useful for identifying clusters of arbitrary shape in data containing noise or outliers. The algorithm works by defining clusters as dense regions of points separated by less dense regions. The DBSCAN algorithm takes two main parameters: epsilon (eps) and min_samples. Epsilon defines the radius around each point, and min_samples defines the minimum number of points required to form a dense region.
The algorithm works with three different types of points:
- core point has at least min_samples points in its eps radius.
- border point has less than min_samples points in its eps radius, but at least one of these points inside the radius is the core point.
- noise point is the same as the border point but has no core point in its eps radius.
By classifying points as core, border, or noise, the DBSCAN algorithm can identify and separate clusters from noise points.
To understand DBSCAN, we also need to mention two definitions:
Reachability in terms of density establishes a point to be reachable from another if it lies within a certain distance (eps) from it. But reachability was also used in Mean-shift clustering: in previous chapters, we verified that this algorithm couldn't work with arbitrary-shaped clusters.
To detect clusters with the arbitrary shape we need connectivity: it involves a transitivity-based approach to determine whether points are located in a particular cluster. For example, p and q points could be connected if q->p1->p2->p3->p, where x->y means y is reachable from x.
To implement DBSCAN, you have to perform the following steps:
Step 1. Classify the points as core, boundary, and noise.
Step 2. Discard noise(or assign noise into separate cluster).
Step 3. Assign cluster to a core point.
Step 4. Merge all the connected core points of a chosen in step 3 point with this point's cluster.
Step 5. Visit the next unclustered core point and repeat steps 3-4 till there are no unclustered core points in the dataset.
Step 6. Merge boundary points with the nearest core point cluster.
Let's look at how to implement DBSCAN in Python:
DBSCAN algorithm allocates noise to a separate cluster with -1 label.
In the code above, we used
DBSCAN class from
sklearn.cluster module to implement DBSCAN algorithm. We also specified two parameters:
epsparameter determines the reachability radius.
min_samplesparameter determines the minimum amount of samples in reachability radius to classify points as core, border, or noise.
- we use
.fit()method to provide clustering.
DBSCANclass has no implementation for
.predict()method, so we can not provide clustering for new points without retraining the model.
What makes the DBSCAN algorithm able to detect clusters of arbitrary shapes?
Select the correct answer
Everything was clear?