Course Content
Clustering Demystified
How many Clusters?
You may be wondering: But hey, what is the exact number of clusters? We can use the so-called "elbow method".
The elbow method is a technique used to determine the optimal number of clusters in a k-means clustering algorithm. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use. The "elbow" is the point of inflection on the curve where the explained variation begins to decrease at a slower rate. This point is considered the optimal number of clusters because adding more clusters will not significantly improve the explained variation.
Methods description
-
range(start, end)
: This generates a sequence of numbers from start (inclusive) to end (exclusive), representing the range of possible cluster numbers to be tested; -
kmeans.inertia_
: This attribute of theKMeans
object retrieves the inertia value calculated for the current clustering configuration; -
cs
: This is an empty list that will store the "inertia" values calculated for each number of clusters. Inertia represents the sum of squared distances of samples to their closest cluster center; -
plt.plot()
: This function from the matplotlib library (matplotlib.pyplot
) is used to create a line plot. It plots the number of clusters on the x-axis against the corresponding inertia values (CS) on the y-axis; -
plt.title()
,plt.xlabel()
,plt.ylabel()
: These functions set the title, x-axis label, and y-axis label of the plot, respectively; -
plt.show()
: This function displays the plot.
Swipe to show code editor
- Evaluate the
kmeans
from 1 to 10. - Plot the graph.
Thanks for your feedback!
You may be wondering: But hey, what is the exact number of clusters? We can use the so-called "elbow method".
The elbow method is a technique used to determine the optimal number of clusters in a k-means clustering algorithm. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use. The "elbow" is the point of inflection on the curve where the explained variation begins to decrease at a slower rate. This point is considered the optimal number of clusters because adding more clusters will not significantly improve the explained variation.
Methods description
-
range(start, end)
: This generates a sequence of numbers from start (inclusive) to end (exclusive), representing the range of possible cluster numbers to be tested; -
kmeans.inertia_
: This attribute of theKMeans
object retrieves the inertia value calculated for the current clustering configuration; -
cs
: This is an empty list that will store the "inertia" values calculated for each number of clusters. Inertia represents the sum of squared distances of samples to their closest cluster center; -
plt.plot()
: This function from the matplotlib library (matplotlib.pyplot
) is used to create a line plot. It plots the number of clusters on the x-axis against the corresponding inertia values (CS) on the y-axis; -
plt.title()
,plt.xlabel()
,plt.ylabel()
: These functions set the title, x-axis label, and y-axis label of the plot, respectively; -
plt.show()
: This function displays the plot.
Swipe to show code editor
- Evaluate the
kmeans
from 1 to 10. - Plot the graph.