Quality Evaluation

In real-life tasks with real data, it can be difficult to understand which algorithm to use and whether the results are good enough. To determine this, several techniques are used:

Relative cluster validation, which evaluates the clustering structure by varying different parameter values for the same algorithm (e.g.,: varying the number of clusters k for K-means, linkage for agglomerative, eps and min_samples for DBSCAN etc.);
Internal and external cluster validation means that we use internal and external metrics to estimate clustering quality;
Rule of thumb: a stable group should be preserved when changing the clustering method. For example, if the results obtained using the agglomerative method and the K-means method coincide by more than 70%, then the assumption of stability is accepted;
Using resampling methods to evaluate the stability of clustering split:
- whether the split is stable across different subsamples of the original dataset;
- whether the split is stable after some samples were deleted from original the dataset;
- whether the split is stable after changing the order of the elements.
Try to interpret the clustering results in terms of the domain area: is it possible to explain the results of clustering and is there any logic in them.

Note

In the context of data analysis, the domain area refers to the specific field or industry that the data belongs to or is being used for. Examples of domain areas include healthcare, finance, marketing, transportation, and many others.

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 3

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

Cluster Analysis

1. What is Clustering?

Formulation Of The Problem What Is Unique About Clustering?Types Of Clustering

2. Basic Clustering Algorithms

K-means Clustering Perform K-means Clustering Agglomerative Clustering Perform Agglomerative Clustering Mean Shift Clustering DBSCAN Clustering Perform DBSCAN Clustering Using Clustering On Real Data

3. How to choose the best model?