Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
K-means Clustering
course content

Course Content

Cluster Analysis

K-means Clustering K-means Clustering

K-means clustering is the most popular clustering algorithm used to group similar data points together in a dataset. The algorithm works by first selecting a value k, which represents the number of clusters or groups that we want to identify in the data.

Let's briefly describe all the stages of the operation of this algorithm:

Step 1. The algorithm initializes k random points in the dataset, called centroids.

Step 2. Each data point is then assigned to the nearest centroid based on a distance metric, such as Euclidean distance. This process creates k clusters, with each cluster consisting of the data points that are closest to the centroid.

Step 3. The centroids are moved to the center of each cluster.

Step 4. Steps 2 and 3 are repeated. The algorithm iteratively updates the centroids and reassigns data points until convergence, when the centroids no longer move.

We can see that this algorithm is quite simple and intuitive, but it has some severe shortcomings:

  • we need to choose the number of clusters manually.
  • algorithm depends on initial centroid values.
  • the algorithm is highly affected by outliers.

Let's look at K-means implementation in Python:

In the code above, we used the following:

  1. Kmeans class from sklearn. cluster. n_clusters parameter determines the number of clusters in the data
  2. .fit(X) method of Kmeans class fits our model - determines clusters and their centers according to data X
  3. .labels_ attribute of KMeans class stores cluster numbers for each sample of train data(0 cluster, 1 cluster, 2 cluster,...)
  4. .cluster_centers_attribute of KMeans class stores cluster centers coordinates fitted by the algorithm
  5. .predict() method of Kmeans class is used to predict labels of new points

question-icon

Should we use K-means algorithm for clustering tasks if we can't manually determine the number of clusters into which our data should be divided?

Select the correct answer

Everything was clear?

Section 2. Chapter 1
some-alt