k-means clustering algorithm
The k-means clustering algorithm is probably the most widely used clustering algorithm in data science. It is a partitional clustering algorithm. That is, it will divide the data space into k partitions. Points in each partition will form a cluster.
Input and output of the k-means clustering algorithm
The input of the k-means clustering algorithm is the data and an integer number k, representing the expected number of clusters.
The algorithm returns k groups of points from the data. That is, each data point is assigned a cluster ID out of k clusters.
The number of clusters, k, must be smaller than or equal to the number of points, n, in the data. It makes sense right? We cannot ask for four clusters when we have three points.
Why is the algorithm called k-means clustering?
The k-means clustering algorithm creates k representations for k clusters. To create the representations, the algorithm computes a point that is in the very middle of the cluster. To compute a point that is in the very middle of a group of points, the algorithm takes the average of the points of the group. A synonym of the word “average” is “mean”. Since the algorithm deals with k clusters, it deals with k averages of the k groups of points and hence the name of the algorithm became k-means.
Each average or mean is also called a centroid.
The concept of centroids
Given a group of points, a centroid is the center of the points. A centroid is computed by taking the average of all the values in each dimension of the data points in the group.
As an example, assume that we have a one-dimensional dataset, in which we have a group of points containing values 5, 10, 6, 9, 10. The centroid of the group will be the average of the points, (5+10+6+9+10)/5=8.
Now, assume that you have the following group of two-dimensional data points.
Feature 1 | Feature 2 |
4 | 5 |
6 | 10 |
9 | 10 |
11 | 14 |
6 | 7 |
The centroid of this two-dimensional group of points is computed by taking the average of the values in each dimension, Feature 1 and Feature 2. The centroid is:
[(4+6+9+11+6)/5, (5+10+10+14+7)/5] = [7.2, 9.2]
Notice that the centroid of a group of one-dimensional points will also be one-dimensional. The centroid of a group of two-dimensional points will be two-dimensional, so and so forth.
Exercise: What is the centroid of the following points?
Feature 1 | Feature 2 | Feature 3 | Feature 4 | Feature 5 |
2 | 3 | 1 | 2 | 4 |
6 | 4 | 4 | 5 | 7 |
9 | 10 | 3 | 1 | 2 |
5 | 3 | 6 | 2 | 3 |
11 | 8 | 9 | 10 | 11 |
1 | 5 | 8 | 6 | 4 |
In the k-means clustering algorithm, the desired k clusters (or groups of points) are represented by k centroids. Ideally, in good clustering results, the assigned centroid of a cluster should be the closest centroid out of k centroids from all the points of the cluster.
The objectives of clustering
There are two objectives in a clustering task.
- Small intracluster distance: The distance between two points in the same cluster should be as small as possible. The distance between two points in the same cluster is called intracluster distance. A clustering objective is to make sure that intracluster distance is small. In general, intracluster distance reflects on an important property of clustering — a cluster is a group of similar points. Similar points are close to each other in the data space resulting in a low intracluster distance.
- Large intercluster distance: Intercluster distance refers to the distance of two points from two different clusters. Intercluster distance is an indication of how different two clusters are. As much as we desire to have similar points in the same cluster, we want dissimilar points to be far away. Therefore, a good clustering algorithm should keep a pair of points from two different clusters far away. That is, the second clustering objective is to make sure that intercluster distance is as large as possible.
The summary is — a clustering algorithm discovers clusters of points with small intracluster distance and large intercluster distance. Let us discuss the algorithm now.
k-means clustering algorithm
Given the dataset D and the number of clusters k, the k-means clustering algorithm returns the cluster assignments of all the data points in an array of length equal to the number of rows in D. The pseudocode of the k-means algorithm is as follows.
Input: data D, number of clusters k
Output: idx (an array of length equal to the number of rows in D)
The ith cell of idx tells which cluster the ith data point
belongs.
Auxilary: centroids: A 2D array with k rows. The number of columns
in centroids is equal to the number of columns in D)
idxPrev: A 1D array of length equal to the idx.
kMeans:
idx=Randomly assign clusters to each data point by
assigning a random integer between 1 and k to
each cell of idx.
idxPrev=idx
idxPrev(1)=any number not idx(1);
// This is to make sure that idxPrev is
// different than idx after initialization.
Repeat until idx == idxPrev
For each cluster j=1 to k
centroids(j)=aveage of all data points
marked as cluster j in idx
idxPrev=idx
For each data point i=1 to n
Compute distances between ith data point
in D and all centroids
idx(i)= index c of the nearest centroid
from the ith data point
End of kMeans
The algorithm above initializes by assigning a random cluster to each data point. It then keeps improving the assignments based on computed centroids and the distances between centroids and data points. Once there is an iteration where the cluster assignments do not change, the algorithm returns to the caller with the clustering result in variable idx
.
Python code of k-means clustering
Luckily, we do not have to code the algorithm because Python already has a class for k-means clustering algorithm. The library that has the k-means class is called scikit-learn. scikit-learn is a rich library for machine learning in Python.
Consider that we have the following dataset and we want to find two clusters of data points.
Feature 1 | Feature 2 |
5 | 6 |
15 | 16 |
6 | 4 |
16 | 16 |
17 | 18 |
7 | 3 |
6.5 | 4.5 |
The Python code for this clustering is provided below. I saved the code in a file named mykmeans.py
.
from sklearn.cluster import KMeans
import numpy as np
data=np.array(
[[5, 6],
[15, 16],
[6, 4],
[16, 16],
[17, 18],
[7, 3],
[6.5, 4.5],
]
)
result = KMeans(n_clusters=2).fit(data)
print("Cluster assignments: ")
print(result.labels_)
print("Cluster centroids: ")
print(result.cluster_centers_)
Clearly, I hardcoded the data. The class for kmeans in scikit-learn is KMeans
, as written in the code above. The fit function received the data. A parameter, n_clusters
, receives the expected number of clusters. In this case, n_clusters=2
indicates that the algorithm will return two clusters.
Here are the shell command and the output of the program.
$ python3 mykmeans.py
Cluster assignments:
[1 0 1 0 0 1 1]
Cluster centroids:
[[16. 16.66666667]
[ 6.125 4.375 ]]
$
Notice that the cluster assignments are: [1 0 1 0 0 1 1] indicating that:
Row 1 of the data is in cluster 1.
Row 2 of the data is in cluster 0.
Row 3 of the data is in cluster 1.
Row 4 of the data is in cluster 0.
Row 5 of the data is in cluster 0.
Row 6 of the data is in cluster 1.
Row 7 of the data is in cluster 1.
Now, notice closely in the dataset that Rows 1, 3, 6, and 7 have smaller values, and Rows 2, 4, and 5 have larger values. The k-means clustering algorithm has placed the larger values in Cluster 0 and smaller values in Cluster 1.
The program also outputs the cluster centroids. Notice that the first row of the centroids has larger values. This is because of the average of the larger data values in cluster 0.
The second row of the centroids
variable represents cluster 1, where data points have smaller values. As a result, the centroid computed in the second row of the centroids
variable is smaller than the first row.
3 Comments
Sir, when the lesson content be available or is there another site where I can learn this very clearly as yours?
You are fast. I commend your effort in going over the materials from the beginning and quickly reach to this point. My plan is to create a significant amount of content over the next few months (in the summer). Meantime, you can search on Wikipedia with the titles of the lessons that are marked as “under construction”. I will send out emails to members of the site once new materials are posted.
I really appreciate your patience and diligence.
Hi David,
I have completed the writeup for k-means clustering algorithm. I have included the Python code. Python is a free programming platform.
I haven’t created any video on k-means yet. I will make the videos later.
Best regards