Bolin Wu

The EM Algorithm from Scratch

The EM Algorithm from Scratch
2021-01-30 · 15 min read
R unsupervised machine learning

Expectation-maximization (EM) algorithm is a powerful unsupervised
machine learning tool. Conceptually, It is quite similar to k-means
algorithm, which I shared in this post.
However, instead of clustering through estimated means, it cluster
through estimating the distributions parameters and then evaluate how
likely is each observation belong to distributions. Another difference
is that EM uses soft assignment while k-means uses hard assignment.

The content includes:

  1. The procedure of EM algorithm in the
    two-component mixture model context.
  2. How to apply different parts of
    the algorithm step-by-step by simulation data.
  3. Test the algorithm by using data of Hastie et
    al. (2009)
    , as well as
    the built-in faithful and iris dataset.
  4. Algorithm evaluation.

Prerequisite to read the following blog:

  • Basic knowledge of mixture model, multivariate normal distribution, maximum likelihood.
  • R programming.

Main reference:

Loading R packages and data

Loading the packages and the data that are used in this post.

# add this line for installing the uuml package
# remotes::install_github("MansMeg/IntroML",
#                        subdir = "rpackage")

# load the package and data
library(ggplot2)
library(uuml)
data("mixture_data")
y =  mixture_data
data("iris")
data("faithful")

Algorithm Procedure

In the context of simple two-component mixture model, we would like to model the density of given data by two normal distributions:

Y1N(μ1,σ12)Y2N(μ2,σ22)Y=(1Δ)Y1+ΔY2\begin{aligned} Y_{1} & \sim \mathrm{N}(\mu_{1},\sigma_{1}^{2}) \\ Y_{2} & \sim \mathrm{N}(\mu_{2},\sigma_{2}^{2}) \\ Y &= (1 - \Delta)Y_{1} + \Delta Y_{2}\\ \end{aligned}

Where Δ(0,1)\Delta \in (0,1) with Pr(Δ=1)=πPr (\Delta =1) = \pi.

Then the density of Y is

gY(y)=(1π)ϕθ1(y)+πϕθ2(y)\begin{aligned} g_{Y}(y) = (1 - \pi) \phi_{\theta_{1}}(y) + \pi \phi_{\theta_{2}}(y) \end{aligned}

Where ϕθ(x)\phi_{\theta}(x) denotes the normal density with parameters θ=(μ,σ2)\theta = (\mu,\sigma^{2}).

And the log-likelihood on the N training cases of data Z is given by

l(θ;Z)=i=1Nlog[(1π)ϕθ1(yi)+πϕθ2(yi)]\begin{aligned} l(\theta; Z) = \sum_{i=1}^{N} log [ (1 - \pi) \phi_{\theta_{1}}(y_{i}) + \pi \phi_{\theta_{2}}(y_{i}) ] \end{aligned}

Direct maximization of l(θ;Z)l(\theta; Z) is not feasible numerically therefore we proceed it in an iterative way. A more dedicated deduction is available in the reference book.
In short, what we do here is to estimate π\pi by

γi(θ)=Pr(Δi=1θZ)\begin{aligned} \gamma_{i}(\theta) =Pr(\Delta_{i} = 1 | \theta Z) \end{aligned}

γi(θ)\gamma_{i}(\theta) is also called the responsibility of model 2 for observation i.

The EM algorithm consists of three steps:

  1. Initial guesses for the parameters μ1^\hat{\mu_{1}}, σ1^2\hat{\sigma_{1}}^{2}, μ2^\hat{\mu_{2}}, σ2^2\hat{\sigma_{2}}^{2}, π^\hat{\pi}.
  2. Expectation Step: compute the responsibilities

γi^=π^ϕθ2^(yi)(1π^)ϕθ1^(yi)+π^ϕθ2^(yi),i=1,2,...,N\begin{aligned} \hat{\gamma_{i}} = \frac{\hat{\pi} \phi_{\hat{\theta_{2}}}(y_{i})}{ (1 - \hat{\pi}) \phi_{\hat{\theta_{1}}}(y_{i}) + \hat{\pi}\phi_{\hat{\theta_{2}}}(y_{i})}, i = 1,2,...,N \end{aligned}

  1. Maximization Step: : Update the means and variances:

μ1^=i=1N(1γi^)yii=1Nγi^μ2^=i=1Nγi^yii=1Nγi^σ1^=i=1N(1γi^)(yiμ1^)2i=1N(1γi^)σ2^=i=1Nγi^(yiμ2^)2i=1Nγi^π^=i=1Nγi^N\begin{aligned} \hat{\mu_{1}} &= \frac{\sum_{i=1}^{N} (1 - \hat{\gamma_{i}})y_{i}}{\sum_{i = 1}^{N}\hat{\gamma_{i}}} \\ \hat{\mu_{2}} &= \frac{\sum_{i=1}^{N} \hat{\gamma_{i}}y_{i}} {\sum_{i = 1}^{N}\hat{\gamma_{i}}} \\ \hat{\sigma_{1}} &= \frac{\sum_{i=1}^{N} (1 - \hat{\gamma_{i}})(y_{i}-\hat{\mu_{1}})^{2}}{\sum_{i = 1}^{N}(1 -\hat{ \gamma_{i}})} \\ \hat{\sigma_{2}} &= \frac{\sum_{i=1}^{N} \hat{\gamma_{i}}(y_{i}-\hat{\mu_{2}})^{2}}{\sum_{i = 1}^{N}\hat{ \gamma_{i}}} \\ \hat{\pi} &= \sum_{i = 1}^{N}\frac{\hat{\gamma_{i}}}{N} \end{aligned}

  1. Iterate step 2 and 3 until convergence.

A Scratch of Mixture model

Now, let us proceed and grasp the concepts mentioned above more clearly
with code and data.

(1) Implement a function to simulate data from a univariate mixture model by mixture model as mentioned above.

r_uni_two_com = function(n, theta){
  # simulate y1 and y2 according to the equations mentioned above
  y1 = rnorm(n,theta$mu_1,theta$sigma_1)
  y2 = rnorm(n,theta$mu_2,theta$sigma_2)
  # mixed
  y = (1 - theta$pi) * y1 + theta$pi * y2
  return(y)
}

(2) Simulate 200 observations using μ1=2\mu_1=-2, μ2=1.5\mu_2=1.5, σ1=2\sigma_1=2, σ2=1\sigma_2=1 and π=0.3\pi=0.3. Visualize observations in a histogram.

n = 200
theta_0 = list(mu_1 = -2, mu_2 = 1.5, sigma_1 = 2, sigma_2 = 1, pi = 0.3)
set.seed(2020)
simulation_data = r_uni_two_com(n,theta_0)
hist(x = simulation_data, main = "Histgram of Simulated Data of Mixture Model", xlab = "Mixed Y")

(3) Compute the density value for a given set of parameters and data with a function d_uni_two_comp(x,θ)d\_uni\_two\_comp(x, \theta).

# get the density by using dnorm() function
d_uni_comp = function(x,theta){
  density_m1 = dnorm(x,theta$mu_1,theta$sigma_1)
  density_m2 = dnorm(x, theta$mu_2,theta$sigma_2)
  return(list(density_m1 = density_m1,density_m2 = density_m2 ))
}

library(reshape2) # for merging the density plots
# melt the density into one column so that we can plot them in one figure
data<- melt(d_uni_comp(simulation_data,theta_0) )
# combine the it with simulation data
data = cbind(simulation_data,data)
ggplot(data,aes(x=simulation_data,y = value, color=L1)) + geom_line(alpha=0.7)+ theme_minimal() + ggtitle("Simutated Data versus Corresponding Density") +
  theme(plot.title = element_text(hjust = 0.5),legend.title = element_blank()) +
  coord_cartesian(xlim =c(-4,4))

Model 1 (red line) seems to be centered around -2 and model 2 (blue line) seems to be centered around 2. It looks reasonable according to the data generation process.

(4) Calculate γ\gamma at the expectation step.

Here the function e_uni_two_comp(X,theta)e\_uni\_two\_comp(X, theta) returns a vector of gamma for each row in X.

# use equation above to calculate the resposibility
e_uni_two_camp = function(x,theta){
  gamma_i =( theta$pi * d_uni_comp(x,theta)$density_m2 ) /
    ( (1-theta$pi) * d_uni_comp(x,theta)$density_m1 +
        theta$pi * d_uni_comp(x,theta)$density_m2  )
  return(gamma_i)
}

# initial guess of parameters
theta_0 = list(mu_1 = 4.12, mu_2 = 0.94, sigma_1 = 2, sigma_2 = 2, pi = 0.5)
gamma = e_uni_two_camp(y,theta_0)
head(gamma)
##           [,1]
## [1,] 0.9106339
## [2,] 0.8716861
## [3,] 0.7797225
## [4,] 0.6645640
## [5,] 0.6484311
## [6,] 0.5178799

Initially, μ1\mu_1 = 4.12, μ2\mu_2 = 0.94, σ1\sigma_1 = 2, σ2\sigma_2 = 2. According to the definition of responsibility, the larger responsibility ( γ\gamma ) of a given x, the more likely it is belong to model 2, otherwise it may more likely to belong to model 1.
For example, we can see that the first 6 observations, the responsibilities are bigger than 0.5, so they are likely belong to model 2. However, we need to continue on iterating the process until it converges.

(5) Implement a function called max_uni_two_comp(X,gamma)max\_uni\_two\_comp(X, gamma) that returns a list with parameters mu1, mu2, sigma1, sigma2 and pi at the maximization step.

max_uni_two_comp = function(y,gamma){
  mu_1 = sum( (1 - gamma) * y ) / sum( 1 - gamma)
  mu_2 = sum( ( gamma) * y ) / sum(  gamma)
  sigma_1 =  sum( (1 - gamma) * ((y-mu_1)^2) ) / sum( 1 - gamma)
  sigma_2 = sum( ( gamma) * (y-mu_2)^2 ) / sum( gamma)
  pi = sum(gamma)/ length(y)
  return(list(mu_1 = mu_1, mu_2 = mu_2, sigma_1 = sqrt(sigma_1),
              sigma_2 = sqrt(sigma_2), pi = pi))
}

theta = max_uni_two_comp(y,gamma)

theta
    ## $mu_1
    ## [1] 3.842941
    ##
    ## $mu_2
    ## [1] 1.450413
    ##
    ## $sigma_1
    ## [1] 1.700666
    ##
    ## $sigma_2
    ## [1] 1.47168
    ##
    ## $pi
    ## [1] 0.4883709

These are the estimated parameters after the first iteration.

(6) Now we need to implement the log-likelihood of the model as ll_uni_two_comp(x,theta)ll\_uni\_two\_comp(x, theta).
Log-likelihood is important when evaluating the algorithm.

ll_uni_two_camp = function(x,theta){
  ll_i =sum( log( ( (1 - theta$pi) * d_uni_comp(x,theta)$density_m1 ) +
                    (theta$pi * d_uni_comp(x,theta)$density_m2) )  )
  return(ll_i)
}

ll_uni_two_camp(y,theta_0)
    ## [1] -43.1055

(7) Combine the implemented functions to an EM algorithm em_uni_two_comp(X,theta_0,iter)em\_uni\_two\_comp(X, theta\_0, iter) that takes in a n×pn \times p data matrix X and an initialization value for as theta_0theta\_0.

em_uni_two_comp = function(y,theta){
  iter = 3
  for (i in 1:iter) {
  gamma = e_uni_two_camp(y,theta)
  theta = max_uni_two_comp(y,gamma)
  ll = ll_uni_two_camp(y,theta)

  # print the log-likehood
  cat("Log Lik:",ll,"\n")
}
}

em_uni_two_comp(y,theta_0)
    ## Log Lik: -41.53247
    ## Log Lik: -41.11211
    ## Log Lik: -40.48348

Here the iteration is set to be 3 and we can see that the log likelihood is getting smaller which is as expected.

Test the algorithm on the y for 20 iterations and get the π^\hat{\pi}.

for (iter in c(1,5,10,15,20)) {
  # iter = 20
  # intial gamma
  gamma = e_uni_two_camp(y,theta_0)
  # start the iteration of EM
  for (i in 1:iter) {
  theta = max_uni_two_comp(y,gamma)
  ll = ll_uni_two_camp(y,theta)
  gamma = e_uni_two_camp(y,theta)
}
cat("iteration=",iter,";","estimated pi = ",theta$pi, "\n")
}
    ## iteration= 1 ; estimated pi =  0.4883709
    ## iteration= 5 ; estimated pi =  0.4981389
    ## iteration= 10 ; estimated pi =  0.5436594
    ## iteration= 15 ; estimated pi =  0.5532677
    ## iteration= 20 ; estimated pi =  0.5544302

Evaluation

Next we can evaluate EM algorithm by plotting the change of log likeliood to see if it converges.

estimated_final_ll =c()
  iter = 20
  # intial gamma
  gamma = e_uni_two_camp(y,theta_0)
  # start the iteration of EM
  for (i in 1:iter) {
  theta = max_uni_two_comp(y,gamma)
  ll = ll_uni_two_camp(y,theta)
  gamma = e_uni_two_camp(y,theta)
  estimated_final_ll = cbind(estimated_final_ll,ll)
}


plot(estimated_final_ll[1,],
     ylab = "observed Data Log-Likelihood",
     xlab = "iteration",
     main = "Estimated Log-likelihood over Iterations")

The trend looks converged after iteration = 10. Therefore the algorithm is working well for the given dataset.

Run the EM algorithm with the other dataset.

Here I set the iteration to be 20. The estimated parameters will be
printed under the code.

Eruptions variable of Faithful dataset:

erruptions = faithful$eruptions
# initial guess of parameters
theta_erup = list(mu_1 = 1.75, mu_2 = 4.5, sigma_1 = 1, sigma_2 = 1, pi = 0.55)

estimated_final_ll =c()
  iter = 20
  # intial gamma
    gamma = e_uni_two_camp(erruptions,theta_erup )
  # start the iteration of EM
  for (i in 1:iter) {
  theta_faithful = max_uni_two_comp(erruptions,gamma)
  ll = ll_uni_two_camp(erruptions,theta)
  gamma = e_uni_two_camp(erruptions,theta)
  }
theta_faithful
    ## $mu_1
    ## [1] 4.256629
    ##
    ## $mu_2
    ## [1] 2.05479
    ##
    ## $sigma_1
    ## [1] 0.4899172
    ##
    ## $sigma_2
    ## [1] 0.3383707
    ##
    ## $pi
    ## [1] 0.3491834

Petal.Length variable of Iris dataset:

pd_length = iris$Petal.Length
estimated_final_ll =c()
# set the initial parameters
theta_iris = list(mu_1 =1, mu_2 = 4, sigma_1 = 1, sigma_2 = 1, pi = 0.5)
  iter = 20
  # intial gamma
    gamma = e_uni_two_camp(pd_length,theta_iris )
  # start the iteration of EM
  for (i in 1:iter) {
  theta_iris = max_uni_two_comp(pd_length,gamma)
  ll = ll_uni_two_camp(pd_length,theta)
  gamma = e_uni_two_camp(pd_length,theta)
  }
theta_iris
## $mu_1
## [1] 4.919392
##
## $mu_2
## [1] 1.502265
##
## $sigma_1
## [1] 0.8159113
##
## $sigma_2
## [1] 0.3266122
##
## $pi
## [1] 0.3398739

Visualize the density for the two datasets using the parameters estimated with EM algorithm.

First visualize for the faithful dataset

data<- melt(d_uni_comp(erruptions,theta_faithful))
# combine the it with simulation data
data = cbind(erruptions,data)

ggplot(data,aes(x=erruptions,y = value, color=L1)) + geom_line(alpha=0.7)+ theme_minimal() + ggtitle("Density for Erruptions after Estimated by EM Algorithm") +theme(plot.title = element_text(hjust = 0.5),legend.title = element_blank())


EM estimated density of faithful data

Then visualize for the iris dataset

data<- melt(d_uni_comp(pd_length,theta_iris))
# combine the it with simulation data
data = cbind(pd_length,data)

ggplot(data,aes(x=pd_length,y = value, color=L1)) + geom_line(alpha=0.7)+ theme_minimal() + ggtitle("Density for Pedal Length after Estimated by EM Algorithm") +theme(plot.title = element_text(hjust = 0.5),legend.title = element_blank())


*EM estimated density of iris data *

These clustering results looks reasonable. Great!

Conclusion

This post shared how to derive the basic pieces of EM algorithm in the two-component mixture model case. We can see in the end the algorithm gives us the a mixture distribution based on the given dataset instead of telling directly which observations belong to which cluster.
In the future I will share how to use EM algorithm in general.

Thank you for reading!

Prudence is a fountain of life to the prudent.