In real-life tasks with real data, it can be difficult to understand which algorithm to use and whether the results are good enough. To determine this, several techniques are used:
- Relative cluster validation, which evaluates the clustering structure by varying different parameter values for the same algorithm (e.g.,: varying the number of clusters k for K-means, linkage for agglomerative, eps and min_samples for DBSCAN etc.).
- Internal and external cluster validation means that we use internal and external metrics to estimate clustering quality.
- Rule of thumb: a stable group should be preserved when changing the clustering method. For example, if the results obtained using the agglomerative method and the K-means method coincide by more than 70%, then the assumption of stability is accepted.
- Using resampling methods to evaluate the stability of clustering split:
- whether the split is stable across different subsamples of the original dataset.
- whether the split is stable after some samples were deleted from original the dataset.
- whether the split is stable after changing the order of the elements.
- Try to interpret the clustering results in terms of the domain area: is it possible to explain the results of clustering and is there any logic in them.
In the context of data analysis, the domain area refers to the specific field or industry that the data belongs to or is being used for. Examples of domain areas include healthcare, finance, marketing, transportation, and many others.
Can we consider the results of clustering to be stable if different algorithms produce completely different clusters?
Select the correct answer
Everything was clear?