Course Content
Cluster Analysis
Cluster Analysis
What Is Unique About Clustering?
Clustering is a type of machine learning in which the model is trained on unlabelled data without any predefined target variable or correct output (it is called unsupervised learning). The goal is to identify hidden patterns or structures in the data without any prior knowledge of the output.
By this, the approach to learning also changes: in supervised learning, we have to minimize the difference between the predicted value and the actual value (label), while in unsupervised, we must determine which function we will minimize to solve a specific problem (it can be cross entropy when working with images, different kinds of mathematical norms for working with numerical data, density when using statistical methods, etc.).
Simply, we need to choose by what criteria we will consider objects close to each other for clustering. In most of the algorithms, usual euclidean distance is used for this:
Also, there are often used intra-cluster (the distance between a data item and the cluster centroid within a cluster) and inter-cluster (the distance between the data items in distinct clusters) distances: the smaller is intra-cluster and the greater the inter-cluster distance, the better our algorithm coped with the clustering task.
Now, let`s discuss the advantages and disadvantages of clustering.
Pros:
- clustering helps to solve machine learning problems without requiring us to label data, which can be time-consuming;
- clustering algorithms can help us to enhance data quality by detecting outliers, reducing data dimensions, and engineering features;
- clustering can help us identify valuable patterns and insights in our data;
- clustering algorithms can work with data that doesn't follow a consistent pattern over time.
Cons:
- clustering can be expensive because it may require human experts to interpret the patterns and connect them to domain knowledge;
- there's no guarantee that clustering will provide useful results since we don't have labeled data to validate the outcomes;
- the accuracy of clustering results can vary depending on the method used.
Thanks for your feedback!