Bolin Wu

Prediction of Children Anaemia Rate by LASSO

2021-03-06 · 8 min read
R penalized model

This post investigates the five factors that are related to anaemia in children by using the data collected from the World Health Organization. The method we will use is LASSO, which is a classic penalized regression. In this post we will see how LASSO filter out the variable for us and its prediction performance compared with our baseline model, linear regression.
To implement LASSO in R, the package I used is "glmnet".

Data Description

The variables are downloaded from different sections and countries on the website and then merged together manually. The cleaned data set is available here.
Please note that usually people need to do descriptive analysis of all the variables before start modelling. However in this post we are focusing on the LASSO implementation therefore it is not shown here.

Independent Variable:

  • Health Expenditure as percentage of GDP (%\%): This factor measures government’s health expenditurea as percentage of GDP. This is a main factor to evaluate government’s influence on the prevalence of anaemia.
  • Antenatal care coverage - at least four visits (%\%): Antenatal care coverage (at least one visit) is the percentage of women aged 15–49 with a live birth in a given time period that received antenatal care provided by skilled health personnel at least once during their pregnancy.
  • Prevalence of low birth weight (%\%): Low birth weight is defined as weight at birth less than 2500h (5.5 lb).
  • Prevalence of anaemia in pregnant women (%\%): It measures the occurrence of anaemia happen among pregnant women.
  • Nursing and midwifery personnel (per 10,000 population): Nurses and midwives include professional nurses, professional midwives, auxiliary nurses, auxiliary midwives, enrolled nurses, enrolled midwives and other associated personnel, such as dental nurses and primary care nurses.

Dependent Variables

  • Prevalence of anaemia in children (%\%): It measures the occurrence of anaemia happen among children who are under 6 months.

All variables above are percentages/rate (per 10,000 population), so there is no measure of unit for each variable

LASSO Intuition

LASSO is one of the most popular regression analyses that perform variable selection in statistics and machine learning field. Different from linear regression, LASSO regression is biased but it provides a model with smaller variance, which makes it accurate in terms of doing prediction.

The following equation is the expression of LASSO regression.

minβj(1Ni=1n(yijpxijβj)2+λj=1pβj))\begin{aligned} \min_{\beta_{j}}( \frac{1}{N} \sum_{i=1}^{n}(y_{i} - \sum_{j}^{p} x_{ij} \beta_{j})^2 + \lambda \sum_{j=1}^{p} |\beta_{j})| ) \end{aligned}

From the equation we can see that the regression can be basically divided into two parts. One part is the OLS linear regression part. The other part to the right is what we called the penalty term. The bigger λ\lambda is, the stronger restriction on β\beta will be.


Fig.1 - Visualization of LASSO in two-dimemsion situation

The Figure 1 above helps us better understand how the LASSO works in the two-dimension scenario. Notice that the point where the red circle touches the blue square, β1\beta_{1} equals zero. It means β1\beta_{1} is removed from our model. At the same time, on that point we can acquire the LASSO model with the smallest mean squared error.

Implementation in R

With the help of "glmnet" package in R, we can perform LASSO regression on our data.

# function to calculate mse
MSE <- function(y,yhat){
  mse = sum((y-yhat)^2) / nrow(y)
  return(mse)
}

# data for following analysis
datain <- read_csv("Clean_dataset.csv", col_names = TRUE, na = c("NA")) %>%
  rename(child_anaemia = "Prevalence of anaemia in children",
         pregnant_anameia = "Prevalence of anaemia in pregnant women",
         low_weight = "Low birth weight prevalence",
         health_expdit = "Health expenditure in GDP",
         nurse_midwf = "Nursing and midwifery personnel (per 10 000 population)",
         antenatal = "Antenatal care coverage",
         breastfed = "Infants breastfed for the first six months"
         )

set.seed(2019)
train_ids = sample(nrow(datain), size = 2/3 * nrow(datain), replace = FALSE)
train = datain[train_ids, 3:8]
test = datain[-train_ids, 3:8]


#---------------------------------------------------------#
#                    LASSO with glmnet                  #
#---------------------------------------------------------#
lasso = glmnet(y = as.matrix(train[, 1]),
               x = as.matrix(train[, 2:6]),
               alpha=1, standardize=T,
               family='gaussian')
lasso
plot(lasso, xvar = "lambda", label = TRUE)

lasso_cv <- cv.glmnet(y = as.matrix(train[,1]),
                      x = as.matrix(train[, 2:6]),
                      alpha=1, standardize=T,
                      family='gaussian')
plot(lasso_cv)

# check coefficient
coef1 <- coef(lasso, s = lasso_cv$lambda.1se)
coef2 <- coef(lasso, s = lasso_cv$lambda.min)

lasso_est <-predict(lasso, newx=as.matrix(train[,2:6]),
              s=lasso_cv$lambda.min)
lasso_mse <- MSE(train['child_anaemia'], lasso_est)
lasso_mse

# get the best lambda

lasso_cv$lambda.min


Fig.2 - Change of MSE with different lambda


Fig.3 - Variable exclusion process of LASSO

From Figure 2, we can see that when log(λ\lambda) lies between -1.4 to 0, the model can have the smallest mean squared error. Besides, based on Figure 3, we can see the process of the variables being removed regarding to different log(λ\lambda).

Furthermore, we find the best λ\lambda = 0.255 of our model by cross-validation method. The optimal output is listed in the table below.

Variable coefficient
Intercept 8.902
Anaemia in pregnant 1.112
Low birth weight 0
Health expenditure in GDP 0.680
Nursing and midwifery personnel -0.081
Antenatal care -0.164

In the optimal model, the variable "Low birth weight" is removed out. The model's mean squared error is 69.345. "Nursing and midwifery personnel" and "Antenatal care " are negatively related to the occurrence of anaemia in children which is consistent to our expectation. The more health care personnel and antenatal care that pregnant women have, the less likely anaemia in children would happen. However, in this model , the positive relationship between "Health expenditure in GDP" and the occurrence of anaemia in children is not expected.

Compare with Linear Regression

By using lm() function in R we can easily get the linear regression:


#---------------------------------------------------------#
#                    Linear Regression                  #
#---------------------------------------------------------#

lm_model <- lm(formula = child_anaemia ~ ., data = train)
summary(lm_model)

# Predict and estimate the mse on test dataset
lm_est <- predict(lm_model, test1)
lm_mse <- MSE(test1['child_anaemia'], lm_est)
cat("MSE of final linear model:", lm_mse)

# MSE of final linear model: 61.5218


*Fig.4 - Linear Regression Results *

From Figure 4, we can see that not all the variables are significant.

In terms of the explanatory power, linear regression and LASSO have a similar R-square value. In terms of prediction accuracy, linear regression has the better accuracy of this dataset.

Linear Regression LASSO
R-square on training data 0.8216 0.8229
MSE on test data 61.5218 69.345

Conclusion

In our case, the linear regression gives better prediction than LASSO regression. The reason could be that the number of independent variables is not big enough.
It is not uncommon that a simple regression may give better result than "fancier" models. Likewise, sometimes a logistic regression could perform better than Neural Network. The performance largely depends on the data. Therefore it is useful to make comparison between different models given a dataset.

Thanks for reading!

Prudence is a fountain of life to the prudent.