Conteúdo do Curso
Cluster Analysis
Cluster Analysis
K-means Clustering
K-means clustering is the most popular clustering algorithm used to group similar data points together in a dataset. The algorithm works by first selecting a value k, which represents the number of clusters or groups that we want to identify in the data.
Let's briefly describe all the stages of the operation of this algorithm:
Step 1. The algorithm initializes k random points in the dataset, called centroids;
Step 2. Each data point is then assigned to the nearest centroid based on a distance metric, such as Euclidean distance. This process creates k clusters, with each cluster consisting of the data points that are closest to the centroid;
Step 3. The centroids are moved to the center of each cluster;
Step 4. Steps 2 and 3 are repeated. The algorithm iteratively updates the centroids and reassigns data points until convergence, when the centroids no longer move.
We can see that this algorithm is quite simple and intuitive, but it has some severe shortcomings:
- we need to choose the number of clusters manually;
- algorithm depends on initial centroid values;
- the algorithm is highly affected by outliers.
Let's look at K-means implementation in Python:
from sklearn.cluster import KMeans import numpy as np import matplotlib.pyplot as plt import warnings warnings.filterwarnings('ignore') # Create toy dataset to show K-means clustering model X = np.array([[1, 3], [2, 1], [1, 5], [8, 4], [11, 3], [15, 0], [6,1], [10,3], [3,7], [4,5], [12,7]]) # Fit K-means model for 2 clusters kmeans = KMeans(n_clusters=2).fit(X) # Print labels for train data print('Train labels are: ', kmeans.labels_) # Print coordinates of cluster centers print('Cluster centers are: ', kmeans.cluster_centers_) # Visualize the results of clustering fig, axes = plt.subplots(1, 2) axes[0].scatter(X[:, 0], X[:, 1], c=kmeans.labels_, s=50, cmap='tab20b') axes[0].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', s=100) axes[0].set_title('Train data points') # Provide predictions for new data predicted_labels = kmeans.predict([[10, 5], [4, 2], [3, 3], [6, 3]]) print('Predicted labels are: ', predicted_labels) # Visualize prediction results axes[1].scatter([10, 4, 3, 6], [5, 2, 3, 3], c=predicted_labels, s=50, cmap='tab20b') axes[1].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', s=100) axes[1].set_title('Test data points')
In the code above, we used the following:
Kmeans
class fromsklearn. cluster
.n_clusters
parameter determines the number of clusters in the data;.fit(X)
method ofKmeans
class fits our model - determines clusters and their centers according to data X;.labels_
attribute ofKMeans
class stores cluster numbers for each sample of train data(0 cluster, 1 cluster, 2 cluster,...);.cluster_centers_
attribute ofKMeans
class stores cluster centers coordinates fitted by the algorithm;.predict()
method ofKmeans
class is used to predict labels of new points.
Obrigado pelo seu feedback!