K-means Clustering Algorithm from Scratch

2021-01-23 · 10 min read

The k-means algorithm is a well-known unsupervised machine learning algorithm. From The elements of Statistical Learning by Trevor Hastie, Robert Tibshirani and Jerome Friedman, the biggest difference between supervised learning and unsupervised learning is that the data for supervised learning are labeled while the data for unsupervised learning are unlabeled. Statistically speaking, for unsupervised learning, we would like to explore the property of joint density P(X) where X is a p-vector.
In genral, Unsupervised learning is a useful tool to explore the structure of the data and data clustering.

In this blog, I will show the intuition, implementation and evaluation of k-means algorithm.

Reference of this blog:

The Elements of Statistical Learning by Hastie et al. (2009) Chapter 14.

Prerequisite to read the following blog:

Recommend reading the reference above if you have not touched upon k-means algorithm before.
Basic knowledge of R programming (function and loop).

The k-means Algorithm

Basically, K-means clustering starts with a guess for the K cluster centres and then it loops over the following steps until convergence:

Assign each data point to the closest cluster centre. Usually the distance is identified as Euclidean distance.
Update the cluster centre by the coordinate-wise average of all data points that are closest to it.

Data

To facilitate replication, we will use classic "iris" dataset. We will only use the two variables: Petal.length and Petal.width in order to have a two-dimension visualization.
Load the data and visualize the iris as a scatterplot based on the two variables:

library(ggplot2)
data("iris")
ggplot(data=iris,aes(x=Petal.Width,y=Petal.Length,color = Species)) +
    geom_point() +
    theme_minimal()

Fig.1 Data visualization

Implementation

Now we are going to implement different parts of the k-means algorithm by following Algorithm 14.1 in Hastie et al. (2009).
First step: We will implement a function called compute_cluster_means(X,C). X is a $n \times p$ original data matrix, where n is the number of obsevations, p is the number of variables. C is a $p \times 1$ vector of cluster assignments. The ouput matrix is a $K \times p$ matrix where K is the total number of clusters and each row represents the cluster centres.
We could assume the number of cluster to be K and run a loop over the K clusters. However, to make understanding easier, let us suppose that the number of clusters is 3:


compute_cluster_means = function(X,C){
  X = cbind(X,C)
  # transforma to data frame so we can use filter funciton
  # in tidyverse
  K1 = colMeans(as.data.frame(X) %>% filter(C==1))
  K2 = colMeans(as.data.frame(X) %>% filter(C==2))
  K3 = colMeans(as.data.frame(X) %>% filter(C==3))
  return(rbind(K1,K2,K3))
}

# set up the X and C
X = as.matrix(iris[,3:4])
# for the first step, assign random clusters, store in C
C = sample(1:3, nrow(X), replace = TRUE)

m = compute_cluster_means(X, C)
m

   Petal.Length Petal.Width C
K1     3.744444    1.244444 1
K2     3.573469    1.095918 2
K3     3.965957    1.255319 3

Second step: Assign each observation in the original data to each cluster based on the Euclidean distance.
A function called compute_cluster_encoding(X,m) will be implemented. m is a $K \times p$ which calculated above. A $p \times 1$ vector of cluster assignments will be returned.


compute_coluster_encoding = function(X,m){
 K = 3 # number of clusters
  C = c(NA) # store the encoding
  for (i in 1:nrow(X)) {
    c_dist = c() # store the distances
  for (c in 1:K) { # calculate the distance for each K
    c = sum((X[i,] - m[c,1:2])^2) # defination of dissimilarity for K-means
    c_dist = append(c_dist,c)

  }
    # return the position of the smallest value
    C[i] = which.min(c_dist)
  }
    return(C)
}

C = compute_coluster_encoding(X,m)
# check the first 10 observations' clusters
C[1:10]

[1] 2 2 2 2 2 2 2 2 2 2

Third step: Reiterate the process by using the two functions defined above until no cluster assignments change.
Note: The initial cluster assigment is randomized


update_cluster = function(C_update, X){
  C = rep(0,length = length(C_update))
  # run until the vector C does not change
  while (any(C != C_update)) {
  # let C = initial value for the first loop
  C = C_update
  # m for the first loop is a random initialization value
  m <- compute_cluster_means(X, C)
  # update C
  C_update = compute_coluster_encoding(X,m)
  }
  return(C)
}


set.seed(4711)
C_update <- sample(1:3, nrow(X), replace = TRUE)
head(n = 60, update_cluster(C_update,X))

[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[39] 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3

Evaluation

Now we have finished the main parts of k-means algorithm.
However, as you may notice, since the initial cluster assignment is randomized therefore theoretically, we will get different final clustering results by using different initial clusters. To evaluate the performance of the algorithm, we will use k-means within point scatter as defined in Eq. 14.31 in Hastie et al. (2009).

The basic idea of calculating k-means within-point scatter is to first find the mean vector of each cluster, and sum up the sqaured euclidean distance of the observations within each cluster.
This is the critera to choose the best K-means clustering.

W(C) = \sum_{k=1}^{K} N_{k} \sum_{C(i) = k} (x_{i} - \bar{x_{k}})^{2}

where $\bar{x_{k}}$ is the mean vector associated with the kth cluster and $N_{k}$ is the number of observations in cluster K.

k_means_W = function(X,C){
  m = compute_cluster_means(X,C)
  # split the original dataset into different clusters based on the initial random clusters
  X = cbind(X,C)
  G1 = as.data.frame(X) %>% filter(C==1)
  G2 = as.data.frame(X) %>% filter(C==2)
  G3 = as.data.frame(X) %>% filter(C==3)

  # calculate the sqaured euclidean distance within each cluster
  # sweep is used here to substract each row of G1 from m[i,1:2], where i = 1,2,3
  dist_g1 = (sweep(G1, 2, m[1,])[,-3])^2
  dist_g2 = (sweep(G2, 2, m[2,])[,-3])^2
  dist_g3 = (sweep(G3, 2, m[3,])[,-3])^2

  # sum up the sqaured euclidean distance
  return(sum(dist_g1,dist_g2,dist_g3))

}
# get the within pint scatter
k_means_W(X,C)

88.72578

Next, we can run the k-means a couple times and see which gives the smallest within pint scatter:

# number of clusters
K = 3
# the third and fourth columns are Petal.Length and Petal.Width
X = as.matrix(iris[,3:4])
loop = 10
# store the within-point scatter
W_loop_iris = rep(0,length = loop)
# store the final cluster result
C_loop_iris = matrix(NA,nrow = nrow(X),ncol = loop)
for (i in 1:loop) {
  C_update <- sample(1:K, nrow(X), replace = TRUE)
  C_final = update_cluster(C_update,X)
  # k_means_W(X,C_final)
  W_loop_iris[i] = k_means_W(X,C_final)
  # compute_cluster_means(X,C_final)
  C_loop_iris[,i] =C_final
}

W_loop_iris
head(C_loop_iris)

[1] 31.37136 31.37136 31.37136 31.37136 31.37136 31.37136 31.37136 31.37136
 [9] 31.37136 31.37136
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1    2    1    1    1    1    3    3    2     2
[2,]    1    2    1    1    1    1    3    3    2     2
[3,]    1    2    1    1    1    1    3    3    2     2
[4,]    1    2    1    1    1    1    3    3    2     2
[5,]    1    2    1    1    1    1    3    3    2     2
[6,]    1    2    1    1    1    1    3    3    2     2

The assignments for different loops are different but the within point scatters are the same.
I think the main reason is the iris dataset has a clear structure by itself, as we can see in the Figure 1, so that the k-means algorithm just ends up with the same within point scatter every time for different initial clusters. If the structure is not that clear, then we may end up with different clusters and thus have different within point scatters. Another reason is that our number of clusters is only 3, if we increase the number of clusters the final within point scatters may change as well.

Finally, let us visualize the final cluster result:

# get the final clustering (first column)
C_final_iris = C_loop_iris[, 1]
iris_visual = cbind(iris[,3:4],C_final_iris)
iris_visual[,"C_final_iris"] = factor(iris_visual[,"C_final_iris"])

ggplot(data=iris_visual,aes(x=Petal.Width, y=Petal.Length,color = C_final_iris))
  + geom_point()
  + theme_minimal()
  + ggtitle("First Final K-means Clustering Result of iris") +
    theme(plot.title = element_text(hjust = 0.5))

Fig.2 K-means Clustering Result

It is pretty similar to the Fig.1 which indicates that the clustering is reasonable.

Potential improvements

Use a dataset that does not have a clear structure so that we can have more variations of the clustering results.
When writing the algorithm, we can set the K as a parameter and run the loop instead of repeating. Try different number of K and see how the results differ.

Thank you for reading!

The k-means Algorithm
Potential improvements