K-means Clustering

K-means clustering is the most popular clustering algorithm used to group similar data points together in a dataset. The algorithm works by first selecting a value k, which represents the number of clusters or groups that we want to identify in the data.

Let's briefly describe all the stages of the operation of this algorithm:

Step 1. The algorithm initializes k random points in the dataset, called centroids;

Step 2. Each data point is then assigned to the nearest centroid based on a distance metric, such as Euclidean distance. This process creates k clusters, with each cluster consisting of the data points that are closest to the centroid;

Step 3. The centroids are moved to the center of each cluster;

Step 4. Steps 2 and 3 are repeated. The algorithm iteratively updates the centroids and reassigns data points until convergence, when the centroids no longer move.

We can see that this algorithm is quite simple and intuitive, but it has some severe shortcomings:

we need to choose the number of clusters manually;
algorithm depends on initial centroid values;
the algorithm is highly affected by outliers.

Let's look at K-means implementation in Python:


              12345678910111213141516171819202122232425262728
            
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings('ignore')

# Create toy dataset to show K-means clustering model
X = np.array([[1, 3], [2, 1], [1, 5], [8, 4], [11, 3], [15, 0], [6,1], [10,3], [3,7], [4,5], [12,7]])
# Fit K-means model for 2 clusters
kmeans = KMeans(n_clusters=2).fit(X)
# Print labels for train data
print('Train labels are: ', kmeans.labels_)
# Print coordinates of cluster centers
print('Cluster centers are: ', kmeans.cluster_centers_)
# Visualize the results of clustering
fig, axes = plt.subplots(1, 2)
axes[0].scatter(X[:, 0], X[:, 1], c=kmeans.labels_, s=50, cmap='tab20b')
axes[0].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', s=100)
axes[0].set_title('Train data points')

# Provide predictions for new data
predicted_labels = kmeans.predict([[10, 5], [4, 2], [3, 3], [6, 3]])
print('Predicted labels are: ', predicted_labels)
# Visualize prediction results
axes[1].scatter([10, 4, 3, 6], [5, 2, 3, 3], c=predicted_labels, s=50, cmap='tab20b')
axes[1].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', s=100)
axes[1].set_title('Test data points')

In the code above, we used the following:

Kmeans class from sklearn. cluster. n_clusters parameter determines the number of clusters in the data;
.fit(X) method of Kmeans class fits our model - determines clusters and their centers according to data X;
.labels_ attribute of KMeans class stores cluster numbers for each sample of train data(0 cluster, 1 cluster, 2 cluster,...);
.cluster_centers_attribute of KMeans class stores cluster centers coordinates fitted by the algorithm;
.predict() method of Kmeans class is used to predict labels of new points.

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

Cluster Analysis

1. What is Clustering?

Formulation Of The Problem What Is Unique About Clustering?Types Of Clustering

2. Basic Clustering Algorithms

K-means Clustering Perform K-means Clustering Agglomerative Clustering Perform Agglomerative Clustering Mean Shift Clustering DBSCAN Clustering Perform DBSCAN Clustering Using Clustering On Real Data

3. How to choose the best model?

Internal Evaluation External Evaluation Quality Evaluation