This post investigates the five factors that are related to anaemia in children by using the data collected from the World Health Organization. The method we will use is LASSO, which is a classic penalized regression. In this post we will see how LASSO filter out the variable for us and its prediction performance compared with our baseline model, linear regression.
To implement LASSO in R, the package I used is "glmnet".
The variables are downloaded from different sections and countries on the website and then merged together manually. The cleaned data set is available here.
Please note that usually people need to do descriptive analysis of all the variables before start modelling. However in this post we are focusing on the LASSO implementation therefore it is not shown here.
All variables above are percentages/rate (per 10,000 population), so there is no measure of unit for each variable
LASSO is one of the most popular regression analyses that perform variable selection in statistics and machine learning field. Different from linear regression, LASSO regression is biased but it provides a model with smaller variance, which makes it accurate in terms of doing prediction.
The following equation is the expression of LASSO regression.
From the equation we can see that the regression can be basically divided into two parts. One part is the OLS linear regression part. The other part to the right is what we called the penalty term. The bigger is, the stronger restriction on will be.
Fig.1 - Visualization of LASSO in two-dimemsion situation
The Figure 1 above helps us better understand how the LASSO works in the two-dimension scenario. Notice that the point where the red circle touches the blue square, equals zero. It means is removed from our model. At the same time, on that point we can acquire the LASSO model with the smallest mean squared error.
With the help of "glmnet" package in R, we can perform LASSO regression on our data.
# function to calculate mse
MSE <- function(y,yhat){
mse = sum((y-yhat)^2) / nrow(y)
return(mse)
}
# data for following analysis
datain <- read_csv("Clean_dataset.csv", col_names = TRUE, na = c("NA")) %>%
rename(child_anaemia = "Prevalence of anaemia in children",
pregnant_anameia = "Prevalence of anaemia in pregnant women",
low_weight = "Low birth weight prevalence",
health_expdit = "Health expenditure in GDP",
nurse_midwf = "Nursing and midwifery personnel (per 10 000 population)",
antenatal = "Antenatal care coverage",
breastfed = "Infants breastfed for the first six months"
)
set.seed(2019)
train_ids = sample(nrow(datain), size = 2/3 * nrow(datain), replace = FALSE)
train = datain[train_ids, 3:8]
test = datain[-train_ids, 3:8]
#---------------------------------------------------------#
# LASSO with glmnet #
#---------------------------------------------------------#
lasso = glmnet(y = as.matrix(train[, 1]),
x = as.matrix(train[, 2:6]),
alpha=1, standardize=T,
family='gaussian')
lasso
plot(lasso, xvar = "lambda", label = TRUE)
lasso_cv <- cv.glmnet(y = as.matrix(train[,1]),
x = as.matrix(train[, 2:6]),
alpha=1, standardize=T,
family='gaussian')
plot(lasso_cv)
# check coefficient
coef1 <- coef(lasso, s = lasso_cv$lambda.1se)
coef2 <- coef(lasso, s = lasso_cv$lambda.min)
lasso_est <-predict(lasso, newx=as.matrix(train[,2:6]),
s=lasso_cv$lambda.min)
lasso_mse <- MSE(train['child_anaemia'], lasso_est)
lasso_mse
# get the best lambda
lasso_cv$lambda.min
Fig.2 - Change of MSE with different lambda
Fig.3 - Variable exclusion process of LASSO
From Figure 2, we can see that when log() lies between -1.4 to 0, the model can have the smallest mean squared error. Besides, based on Figure 3, we can see the process of the variables being removed regarding to different log().
Furthermore, we find the best = 0.255 of our model by cross-validation method. The optimal output is listed in the table below.
Variable | coefficient |
---|---|
Intercept | 8.902 |
Anaemia in pregnant | 1.112 |
Low birth weight | 0 |
Health expenditure in GDP | 0.680 |
Nursing and midwifery personnel | -0.081 |
Antenatal care | -0.164 |
In the optimal model, the variable "Low birth weight" is removed out. The model's mean squared error is 69.345. "Nursing and midwifery personnel" and "Antenatal care " are negatively related to the occurrence of anaemia in children which is consistent to our expectation. The more health care personnel and antenatal care that pregnant women have, the less likely anaemia in children would happen. However, in this model , the positive relationship between "Health expenditure in GDP" and the occurrence of anaemia in children is not expected.
By using lm() function in R we can easily get the linear regression:
#---------------------------------------------------------#
# Linear Regression #
#---------------------------------------------------------#
lm_model <- lm(formula = child_anaemia ~ ., data = train)
summary(lm_model)
# Predict and estimate the mse on test dataset
lm_est <- predict(lm_model, test1)
lm_mse <- MSE(test1['child_anaemia'], lm_est)
cat("MSE of final linear model:", lm_mse)
# MSE of final linear model: 61.5218
*Fig.4 - Linear Regression Results *
From Figure 4, we can see that not all the variables are significant.
In terms of the explanatory power, linear regression and LASSO have a similar R-square value. In terms of prediction accuracy, linear regression has the better accuracy of this dataset.
Linear Regression | LASSO | |
---|---|---|
R-square on training data | 0.8216 | 0.8229 |
MSE on test data | 61.5218 | 69.345 |
In our case, the linear regression gives better prediction than LASSO regression. The reason could be that the number of independent variables is not big enough.
It is not uncommon that a simple regression may give better result than "fancier" models. Likewise, sometimes a logistic regression could perform better than Neural Network. The performance largely depends on the data. Therefore it is useful to make comparison between different models given a dataset.
Thanks for reading!