Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
K-means Clustering | Basic Clustering Algorithms
Cluster Analysis
course content

Course Content

Cluster Analysis

Cluster Analysis

1. What is Clustering?
2. Basic Clustering Algorithms
3. How to choose the best model?

bookK-means Clustering

K-means clustering is the most popular clustering algorithm used to group similar data points together in a dataset. The algorithm works by first selecting a value k, which represents the number of clusters or groups that we want to identify in the data.

Let's briefly describe all the stages of the operation of this algorithm:

Step 1. The algorithm initializes k random points in the dataset, called centroids;

Step 2. Each data point is then assigned to the nearest centroid based on a distance metric, such as Euclidean distance. This process creates k clusters, with each cluster consisting of the data points that are closest to the centroid;

Step 3. The centroids are moved to the center of each cluster;

Step 4. Steps 2 and 3 are repeated. The algorithm iteratively updates the centroids and reassigns data points until convergence, when the centroids no longer move.

We can see that this algorithm is quite simple and intuitive, but it has some severe shortcomings:

  • we need to choose the number of clusters manually;
  • algorithm depends on initial centroid values;
  • the algorithm is highly affected by outliers.

Let's look at K-means implementation in Python:

12345678910111213141516171819202122232425262728
from sklearn.cluster import KMeans import numpy as np import matplotlib.pyplot as plt import warnings warnings.filterwarnings('ignore') # Create toy dataset to show K-means clustering model X = np.array([[1, 3], [2, 1], [1, 5], [8, 4], [11, 3], [15, 0], [6,1], [10,3], [3,7], [4,5], [12,7]]) # Fit K-means model for 2 clusters kmeans = KMeans(n_clusters=2).fit(X) # Print labels for train data print('Train labels are: ', kmeans.labels_) # Print coordinates of cluster centers print('Cluster centers are: ', kmeans.cluster_centers_) # Visualize the results of clustering fig, axes = plt.subplots(1, 2) axes[0].scatter(X[:, 0], X[:, 1], c=kmeans.labels_, s=50, cmap='tab20b') axes[0].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', s=100) axes[0].set_title('Train data points') # Provide predictions for new data predicted_labels = kmeans.predict([[10, 5], [4, 2], [3, 3], [6, 3]]) print('Predicted labels are: ', predicted_labels) # Visualize prediction results axes[1].scatter([10, 4, 3, 6], [5, 2, 3, 3], c=predicted_labels, s=50, cmap='tab20b') axes[1].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', s=100) axes[1].set_title('Test data points')
copy

In the code above, we used the following:

  1. Kmeans class from sklearn. cluster. n_clusters parameter determines the number of clusters in the data;
  2. .fit(X) method of Kmeans class fits our model - determines clusters and their centers according to data X;
  3. .labels_ attribute of KMeans class stores cluster numbers for each sample of train data(0 cluster, 1 cluster, 2 cluster,...);
  4. .cluster_centers_attribute of KMeans class stores cluster centers coordinates fitted by the algorithm;
  5. .predict() method of Kmeans class is used to predict labels of new points.
Should we use K-means algorithm for clustering tasks if we can't manually determine the number of clusters into which our data should be divided?

Should we use K-means algorithm for clustering tasks if we can't manually determine the number of clusters into which our data should be divided?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 2. Chapter 1
We're sorry to hear that something went wrong. What happened?
some-alt