Expectation-maximization (EM) algorithm is a powerful unsupervised
machine learning tool. Conceptually, It is quite similar to k-means
algorithm, which I shared in this post.
However, instead of clustering through estimated means, it cluster
through estimating the distributions parameters and then evaluate how
likely is each observation belong to distributions. Another difference
is that EM uses soft assignment while k-means uses hard assignment.
The content includes:
Prerequisite to read the following blog:
Main reference:
Loading the packages and the data that are used in this post.
# add this line for installing the uuml package
# remotes::install_github("MansMeg/IntroML",
# subdir = "rpackage")
# load the package and data
library(ggplot2)
library(uuml)
data("mixture_data")
y = mixture_data
data("iris")
data("faithful")
In the context of simple two-component mixture model, we would like to model the density of given data by two normal distributions:
Where with .
Then the density of Y is
Where denotes the normal density with parameters .
And the log-likelihood on the N training cases of data Z is given by
Direct maximization of is not feasible numerically therefore we proceed it in an iterative way. A more dedicated deduction is available in the reference book.
In short, what we do here is to estimate by
is also called the responsibility of model 2 for observation i.
The EM algorithm consists of three steps:
Now, let us proceed and grasp the concepts mentioned above more clearly
with code and data.
(1) Implement a function to simulate data from a univariate mixture model by mixture model as mentioned above.
r_uni_two_com = function(n, theta){
# simulate y1 and y2 according to the equations mentioned above
y1 = rnorm(n,theta$mu_1,theta$sigma_1)
y2 = rnorm(n,theta$mu_2,theta$sigma_2)
# mixed
y = (1 - theta$pi) * y1 + theta$pi * y2
return(y)
}
(2) Simulate 200 observations using , , , and . Visualize observations in a histogram.
n = 200
theta_0 = list(mu_1 = -2, mu_2 = 1.5, sigma_1 = 2, sigma_2 = 1, pi = 0.3)
set.seed(2020)
simulation_data = r_uni_two_com(n,theta_0)
hist(x = simulation_data, main = "Histgram of Simulated Data of Mixture Model", xlab = "Mixed Y")
(3) Compute the density value for a given set of parameters and data with a function .
# get the density by using dnorm() function
d_uni_comp = function(x,theta){
density_m1 = dnorm(x,theta$mu_1,theta$sigma_1)
density_m2 = dnorm(x, theta$mu_2,theta$sigma_2)
return(list(density_m1 = density_m1,density_m2 = density_m2 ))
}
library(reshape2) # for merging the density plots
# melt the density into one column so that we can plot them in one figure
data<- melt(d_uni_comp(simulation_data,theta_0) )
# combine the it with simulation data
data = cbind(simulation_data,data)
ggplot(data,aes(x=simulation_data,y = value, color=L1)) + geom_line(alpha=0.7)+ theme_minimal() + ggtitle("Simutated Data versus Corresponding Density") +
theme(plot.title = element_text(hjust = 0.5),legend.title = element_blank()) +
coord_cartesian(xlim =c(-4,4))
Model 1 (red line) seems to be centered around -2 and model 2 (blue line) seems to be centered around 2. It looks reasonable according to the data generation process.
(4) Calculate at the expectation step.
Here the function returns a vector of gamma for each row in X.
# use equation above to calculate the resposibility
e_uni_two_camp = function(x,theta){
gamma_i =( theta$pi * d_uni_comp(x,theta)$density_m2 ) /
( (1-theta$pi) * d_uni_comp(x,theta)$density_m1 +
theta$pi * d_uni_comp(x,theta)$density_m2 )
return(gamma_i)
}
# initial guess of parameters
theta_0 = list(mu_1 = 4.12, mu_2 = 0.94, sigma_1 = 2, sigma_2 = 2, pi = 0.5)
gamma = e_uni_two_camp(y,theta_0)
head(gamma)
## [,1]
## [1,] 0.9106339
## [2,] 0.8716861
## [3,] 0.7797225
## [4,] 0.6645640
## [5,] 0.6484311
## [6,] 0.5178799
Initially, = 4.12, = 0.94, = 2, = 2. According to the definition of responsibility, the larger responsibility ( ) of a given x, the more likely it is belong to model 2, otherwise it may more likely to belong to model 1.
For example, we can see that the first 6 observations, the responsibilities are bigger than 0.5, so they are likely belong to model 2. However, we need to continue on iterating the process until it converges.
(5) Implement a function called that returns a list with parameters mu1, mu2, sigma1, sigma2 and pi at the maximization step.
max_uni_two_comp = function(y,gamma){
mu_1 = sum( (1 - gamma) * y ) / sum( 1 - gamma)
mu_2 = sum( ( gamma) * y ) / sum( gamma)
sigma_1 = sum( (1 - gamma) * ((y-mu_1)^2) ) / sum( 1 - gamma)
sigma_2 = sum( ( gamma) * (y-mu_2)^2 ) / sum( gamma)
pi = sum(gamma)/ length(y)
return(list(mu_1 = mu_1, mu_2 = mu_2, sigma_1 = sqrt(sigma_1),
sigma_2 = sqrt(sigma_2), pi = pi))
}
theta = max_uni_two_comp(y,gamma)
theta
## $mu_1
## [1] 3.842941
##
## $mu_2
## [1] 1.450413
##
## $sigma_1
## [1] 1.700666
##
## $sigma_2
## [1] 1.47168
##
## $pi
## [1] 0.4883709
These are the estimated parameters after the first iteration.
(6) Now we need to implement the log-likelihood of the model as .
Log-likelihood is important when evaluating the algorithm.
ll_uni_two_camp = function(x,theta){
ll_i =sum( log( ( (1 - theta$pi) * d_uni_comp(x,theta)$density_m1 ) +
(theta$pi * d_uni_comp(x,theta)$density_m2) ) )
return(ll_i)
}
ll_uni_two_camp(y,theta_0)
## [1] -43.1055
(7) Combine the implemented functions to an EM algorithm that takes in a data matrix X and an initialization value for as .
em_uni_two_comp = function(y,theta){
iter = 3
for (i in 1:iter) {
gamma = e_uni_two_camp(y,theta)
theta = max_uni_two_comp(y,gamma)
ll = ll_uni_two_camp(y,theta)
# print the log-likehood
cat("Log Lik:",ll,"\n")
}
}
em_uni_two_comp(y,theta_0)
## Log Lik: -41.53247
## Log Lik: -41.11211
## Log Lik: -40.48348
Here the iteration is set to be 3 and we can see that the log likelihood is getting smaller which is as expected.
Test the algorithm on the y for 20 iterations and get the .
for (iter in c(1,5,10,15,20)) {
# iter = 20
# intial gamma
gamma = e_uni_two_camp(y,theta_0)
# start the iteration of EM
for (i in 1:iter) {
theta = max_uni_two_comp(y,gamma)
ll = ll_uni_two_camp(y,theta)
gamma = e_uni_two_camp(y,theta)
}
cat("iteration=",iter,";","estimated pi = ",theta$pi, "\n")
}
## iteration= 1 ; estimated pi = 0.4883709
## iteration= 5 ; estimated pi = 0.4981389
## iteration= 10 ; estimated pi = 0.5436594
## iteration= 15 ; estimated pi = 0.5532677
## iteration= 20 ; estimated pi = 0.5544302
Next we can evaluate EM algorithm by plotting the change of log likeliood to see if it converges.
estimated_final_ll =c()
iter = 20
# intial gamma
gamma = e_uni_two_camp(y,theta_0)
# start the iteration of EM
for (i in 1:iter) {
theta = max_uni_two_comp(y,gamma)
ll = ll_uni_two_camp(y,theta)
gamma = e_uni_two_camp(y,theta)
estimated_final_ll = cbind(estimated_final_ll,ll)
}
plot(estimated_final_ll[1,],
ylab = "observed Data Log-Likelihood",
xlab = "iteration",
main = "Estimated Log-likelihood over Iterations")
The trend looks converged after iteration = 10. Therefore the algorithm is working well for the given dataset.
Here I set the iteration to be 20. The estimated parameters will be
printed under the code.
erruptions = faithful$eruptions
# initial guess of parameters
theta_erup = list(mu_1 = 1.75, mu_2 = 4.5, sigma_1 = 1, sigma_2 = 1, pi = 0.55)
estimated_final_ll =c()
iter = 20
# intial gamma
gamma = e_uni_two_camp(erruptions,theta_erup )
# start the iteration of EM
for (i in 1:iter) {
theta_faithful = max_uni_two_comp(erruptions,gamma)
ll = ll_uni_two_camp(erruptions,theta)
gamma = e_uni_two_camp(erruptions,theta)
}
theta_faithful
## $mu_1
## [1] 4.256629
##
## $mu_2
## [1] 2.05479
##
## $sigma_1
## [1] 0.4899172
##
## $sigma_2
## [1] 0.3383707
##
## $pi
## [1] 0.3491834
pd_length = iris$Petal.Length
estimated_final_ll =c()
# set the initial parameters
theta_iris = list(mu_1 =1, mu_2 = 4, sigma_1 = 1, sigma_2 = 1, pi = 0.5)
iter = 20
# intial gamma
gamma = e_uni_two_camp(pd_length,theta_iris )
# start the iteration of EM
for (i in 1:iter) {
theta_iris = max_uni_two_comp(pd_length,gamma)
ll = ll_uni_two_camp(pd_length,theta)
gamma = e_uni_two_camp(pd_length,theta)
}
theta_iris
## $mu_1
## [1] 4.919392
##
## $mu_2
## [1] 1.502265
##
## $sigma_1
## [1] 0.8159113
##
## $sigma_2
## [1] 0.3266122
##
## $pi
## [1] 0.3398739
data<- melt(d_uni_comp(erruptions,theta_faithful))
# combine the it with simulation data
data = cbind(erruptions,data)
ggplot(data,aes(x=erruptions,y = value, color=L1)) + geom_line(alpha=0.7)+ theme_minimal() + ggtitle("Density for Erruptions after Estimated by EM Algorithm") +theme(plot.title = element_text(hjust = 0.5),legend.title = element_blank())
EM estimated density of faithful data
data<- melt(d_uni_comp(pd_length,theta_iris))
# combine the it with simulation data
data = cbind(pd_length,data)
ggplot(data,aes(x=pd_length,y = value, color=L1)) + geom_line(alpha=0.7)+ theme_minimal() + ggtitle("Density for Pedal Length after Estimated by EM Algorithm") +theme(plot.title = element_text(hjust = 0.5),legend.title = element_blank())
*EM estimated density of iris data *
These clustering results looks reasonable. Great!
This post shared how to derive the basic pieces of EM algorithm in the two-component mixture model case. We can see in the end the algorithm gives us the a mixture distribution based on the given dataset instead of telling directly which observations belong to which cluster.
In the future I will share how to use EM algorithm in general.
Thank you for reading!