Bolin Wu

Regression Tree, Random Forest and XGBoost Algorithm

Regression Tree, Random Forest and XGBoost Algorithm
2021-02-06 · 21 min read
R tree-based methods

Tree-based methods are conceptually easy to comprehend and they render advantages like easy visualization and data-preprocessing. It is a powerful tool for both numeric and categorical prediction. In this post I will introduce how to predict baseball player salary by Decision Tree and Random Forest from algorithm coding to package usage.

Content:

  1. Showing the algorithm of Decision Tree by R, which including tree splitting, tree grwoing, bagging, prediction, etc.
  2. Runing Random Forest and XGBoost with the help of randomForest package

Reference:

Regression Tree

Data Pre-processing

First, we need to lead the dataset Hitters provided in the reference uuml package. This dataset consists of the salary of many baseball players and their relevant technical statistics.
Here we assume that we only care about three columns:"Years", "Hits" and "Salary". The missing values are imputed by listwise deletion. And we set the first 30 observations as test set and the rest as training set.

# install relevant packages
# remotes::install_github("MansMeg/IntroML",subdir = "rpackage")
# install.packages("randomForest")
# install.packages("xgboost")

# loead the packages                     
library(xgboost)
library(randomForest)
# load the data
library(uuml)
data("Hitters")
# In the task we only care about three columns:"Years", "Hits" and "Salary"
# and we need to excluede the NA values
# So we need to pre-process the data
Hitters = Hitters[,c("Years", "Hits","Salary")]

# get rid of NA
Hitters <- Hitters[complete.cases(Hitters),]

# set aside test set and training set
X_test <- Hitters[1:30, c("Years", "Hits")]
y_test <- Hitters[1:30, c("Salary")]
X_train <- Hitters[31:nrow(Hitters), c("Years", "Hits")]
y_train <- Hitters[31:nrow(Hitters), c("Salary")]

Tree Splitting

Now let us see how to do the first split. Please note that here does not involve tree growing.

Concept

The alforithm we will use is from the referenced book The Elements of Statistical Learning: Data Mining, Inference, and Prediction (ESL), P307. We are seeking to splitting variable j and split point s that meet:

minj,s[minc1xiR1(j,s)(yic1)2+minc2xiR2(j,s)(yic2)2]\begin{aligned} min_{j,s} [min_{c1} \sum_{x_{i}\in R_{1}(j,s)}(y_{i} - c_{1})^{2} + min_{c_{2}} \sum_{x_{i} \in R_{2}(j,s)}(y_{i} - c_{2})^{2} ] \end{aligned}

Where the inner minimization with regard to j and s is solved by :

c1^=ave(yixiR1(j,s))c2^=ave(yixiR2(j,s))\begin{aligned} \hat{c_{1}} &= ave (y_{i}| x_{i} \in R_{1} (j,s)) \\ \hat{c_{2}} &= ave (y_{i}| x_{i} \in R_{2} (j,s)) \end{aligned}

By first glance, you may get confused by what do those equations mean. Let us use a part of the data to make an illustration:

# use a size of 20
X_check <- Hitters[31:50, c("Years", "Hits")]
y_check <- Hitters[31:50, c("Salary")]

head(X_check)

##           Years   Hits
## -Bob Melvin	2	60
## -BillyJo Robidoux	2	41
## -Bill Schroeder	4	46
## -Chris Bando	6	68
## -Chris Brown	3	132
## -Carmen Castillo	5	57

Essentially, what they do can be explained by the following three steps:

  1. Let j grind over all the variables of the dataset, which in our case is 2 variables Years and Hits. Let s grind over all the values of jth variable. For example, given the sample data above when j = 1 (Years), s will grind from year = 2 (Bob Melvin) to the year of last player.
  2. Allocate each observation according to the given j and s into two groups. And then calculate the mean value of each group, c1^\hat{c_{1}} and c2^\hat{c_{2}}. Get the within group scatters by using the function in the min() of the first equation above.
  3. Return the j and s that give the smallest within group scatter.

This process can be also called greedy method, because we are grinding all the possible values and return the most ideal result.

Code

To illustrate with R code, we will implement a function that takes data set X, the label y, and a minimal leaf size l. This function will give four outputs: The region that each observation belongs to (R1 and R2), splitting variable j and splitting point s.

tree_split = function(X,y,l){
  # store the split point
  S = matrix(NA,nrow = nrow(X), ncol = ncol(X))
  # store the sum of square
  SS = matrix(NA,nrow = nrow(X), ncol = ncol(X))
  for (j in 1:ncol(X)) {
    for (k in 1:nrow(X)) {
      # use the data point as split point
      s = X[k,j]
      # get the size in each leaf
      R1_size = length(  which(X[,j] < s) )
      R2_size = length(  which(X[,j] >= s) )
      # proceed if the size of leaf is bigger than the minimum l
      if (R1_size >= l & R2_size >=l) {
        # 2.1.3
         c1 = mean( y[which(X[,j] < s)] )
         # 2.1.4
         c2 = mean( y[which(X[,j] >= s)] )
         # 2.1.5
         SS[k,j] =  sum( (y[which(X[,j] < s)] - c1)^2 ) +
           sum((y[which(X[,j] >= s)] - c2)^2)
      } else{
        # if the leaf is smaller than the minimum, then set to inf
        SS[k,j] = Inf
      }
      S[k,j] = s
    }
  }
  # find the index of Matix with smallest value
  j = which(SS == min(SS), arr.ind = TRUE)[1,2];
  s = X[which(SS == min(SS), arr.ind = TRUE)[1,1],j];
  R1 = which(X[,j] < s);
  R2 = which(X[,j] >= s)
  return(list(j = j,
               s = s,
               R1 = R1,
               R2 = R2
              , SS = min(SS)
              )
              )
  }

Then check with the sample data, assuming the minimal leaf size to be 5:


tree_split(X_check, y_check, l = 5)

$j
col
  1

$s
[1] 6

$R1
 [1]  1  2  3  5  6  9 13 16 18 19

$R2
 [1]  4  7  8 10 11 12 14 15 17 20

$SS
[1] 1346633

The results seem to be reasonable. What about the first split for the whole training data?

tree_split(X_train , y_train , l = 5 )

$j
col
  1

$s
[1] 5

$R1
 [1]   1   2   3   5   9  13  16  18  19  21  22
[12]  28  30  36  38  40  41  42  45  51  55  57
[23]  73  75  78  79  80  88  89  94  96  99 100
[34] 102 104 107 113 114 118 119 121 128 130 133
[45] 134 138 139 141 142 143 145 146 147 149 151
[56] 152 155 157 167 178 180 183 184 185 187 191
[67] 192 193 194 197 198 199 204 206 210 211 216
[78] 220 222 227 228

$R2
  [1]   4   6   7   8  10  11  12  14  15  17  20
 [12]  23  24  25  26  27  29  31  32  33  34  35
 [23]  37  39  43  44  46  47  48  49  50  52  53
 [34]  54  56  58  59  60  61  62  63  64  65  66
 [45]  67  68  69  70  71  72  74  76  77  81  82
 [56]  83  84  85  86  87  90  91  92  93  95  97
 [67]  98 101 103 105 106 108 109 110 111 112 115
 [78] 116 117 120 122 123 124 125 126 127 129 131
 [89] 132 135 136 137 140 144 148 150 153 154 156
[100] 158 159 160 161 162 163 164 165 166 168 169
[111] 170 171 172 173 174 175 176 177 179 181 182
[122] 186 188 189 190 195 196 200 201 202 203 205
[133] 207 208 209 212 213 214 215 217 218 219 221
[144] 223 224 225 226 229 230 231 232 233

$SS
[1] 38464163


The fist split variate is j =1, which is year. The value is 5. If year is smaller than 5, then the observations go to R1, otherwise go to R2.

Tree Growing

In this part, I will first show the code and then illustrate it.

Code

Conceptually, tree growing is easy to understand: we looping pre-defined tree splitting until the generated leaves are so small that can not be further splitted (leaf size < 2l).
However, it is a bit difficult to implement tree growing by coding. Here I will make a function that takes same X,y, and l. It returns a data frame including j, s, gamma, R1_i and R2_i. Gamma is a metric of within group scatter. The R1_i and R2_i indicates which row of data frame to go next.

max_num_leaf = 7
#this does not affact the output, just set the max depth of the tree, the redundant part will show NA
grow_tree = function(X,y,l){
  # make the matrix to store the data
  S = matrix(NA,nrow = max_num_leaf, ncol = nrow(X))
  j = matrix(NA,nrow = max_num_leaf, ncol = 1)
  s = matrix(NA,nrow = max_num_leaf, ncol = 1)
  gamma_m = matrix(NA,nrow = max_num_leaf, ncol = 1)
  R1_i = rep(NA, length = max_num_leaf);
  R2_i = rep(NA, length = max_num_leaf)
  # the initial value
  S[1,] = c(1 : nrow(X))
  M = 1
  m = 1

  while (m <= M ) {
      # loop until the size is too small to be splitted
     if ( length(S[m,][!is.na(S[m,])])>= (2*l) ){
         # given a specific m:
         # get leaf size after split
       len_col = length((tree_split(X[S[m,],], y[S[m,]], l = l)$R1))
       S[M+1,1:len_col ] = tree_split(X[S[m,],], y[S[m,]], l = l)$R1
       len_col = length((tree_split(X[S[m,],], y[S[m,]], l = l)$R2))
       S[M+2,1:len_col] = tree_split(X[S[m,],], y[S[m,]], l = l)$R2
          # get the split variable and point
       j[m] = tree_split(X[S[m,],], y[S[m,]], l = l)$j
       s[m] = tree_split(X[S[m,],], y[S[m,]], l = l)$s
          # move on
       R1_i[m] = M+1 ; R2_i[m] = M + 2
       M = M +2
     } else {
         # when the size is too small, just return gamma, stop increasing M
       gamma_m [m]= mean( y[S[m,]], na.rm = T )
     }
    m = m + 1

  }
  return(data.frame( j = j, s = s , R1_i=R1_i, R2_i = R2_i, gamma = gamma_m ))
}

Explanation

Here my code may seem a bit too much. I believe that different people will have different approaches to build the algorithm and if you who are reading this post have a neater way please let me know, thank you 😃
Instead of explaining the my code line by line, I would like to explain the general concept with the help of my scrach. Sorry if it is a bit ugly.


An ugly scratch

The key is to use m (green) and M (yellow). The m denotes the index of splitted leaves, the M denotes the total number of leaves given an iteration. Therefore, as long as when the size of leaf m is bigger than 2l, m increases by 1 step while M goes by 2 steps per iteration (since there are two new leaves after each split).
It is kind of like m is chasing M, and M stops when leaves stop growing and m stops when it catches M. Every time when m goes one step, it activates tree splitting function and gathers useful infomation.

Let us try out with the sample data:

tr = grow_tree(X_check,y_check,l = 5)
tr

## j   s  R1_i R2_i gamma
## 1	6	2	3	 NA
## 1	4	4	5	 NA
## 2	102	6	7	 NA
## NA	NA	NA	NA	 317
## NA	NA	NA	NA	 496
## NA	NA	NA	NA	 274
## NA	NA	NA	NA	 539

This data frame can be regarded as a "map". For example, for observation Bob Melvin; year = 2 hits = 60, firstly since year<6, it follows R2_i = 3, going to 3rd row of the data frame. Secondly, since hits<102, it follows R2_i = 7, going to the 7th row. Thirdly, since there are only NA for indicating next step, the 7th row is its destination.

Prediction

Finally, we are at an exciting part, predicting a new observation given our pre-trained decision tree!

The basic idea of prediction is following the output dataframe of the tr. As is mentioned above, R1_i and R2_i indicates the row of the dataframe to go next like a map. It is stopped until the row shows up NA for the first 4 columns.

This function is to predict the classification gamma of new observations.

predict_with_tree = function(newdata, tree){

 pred = matrix(NA,nrow =1 , ncol = ncol(newdata))
 for (i in 1: nrow(newdata)) {
   # start with m =1, the first row
   m =1 ; s = tree[m,2] ; j = tree[m,1]
   while (!is.na(tree[m,1])) {
     if (newdata[i,j] <s) { # follow R1_i
       m = tree[m,3];s = tree[m,2] ; j = tree[m,1]

     }else{# follow R2_i
       m = tree[m,4];s = tree[m,2] ; j = tree[m,1]
     }
     pred[i] = tree[m,5]
   }
 }
 return(pred)
}

Let us pre:

X_new <- Hitters[51:52, c("Years", "Hits")]
y_new <- Hitters[51:52, c("Salary")]
predict_with_tree(newdata = X_new, tree = tr)

##     [,1] [,2]
## [1,]  317  496

The gamma of first observation is 317 and the second is 396.

What is the mean square error on the test set for a tree trained on the whole training data?

# set a large maximum tree depth
max_num_leaf = 50

# since we have more observations than the check data
# set the minimum leaf size = 10
tr = grow_tree(X_train,y_train,l = 10)
pred_tr = predict_with_tree(newdata = X_test, tree = tr)

MSE = mean((pred_tr - y_test) ^2 )
cat("MSE =",MSE)

## MSE = 78395.21

Bagging

Concept

The basic idea of bagged tree regression is that we draw with replacement a random sample of N units from the original sample and fit a prediction, then we repeat it B times. In the end we weigh together the B predictions and derive the final prediction. The picture on P285, ESL tells us that as the number of Bootstrap samples goes greater, the test error goes smaller then it tends to be a constant which is smaller than the original tree.


Figure from P285, ESL

Code

bagged_tree = function(Xtrain, Ytrain, l, B,Xtest){
  sizeN = nrow(Xtrain)
  # store the predictions
  bagged_pred = matrix(NA, nrow = B, ncol = nrow(Xtest))
  for (i in (1:B)) {
    # bootstrap sample
    random_draw = sample( c(1:sizeN ), size = sizeN ,replace = T )
    bagged_tr = grow_tree(Xtrain[random_draw,],Ytrain[random_draw],l)
    bagged_pred_tr = predict_with_tree(newdata = Xtest, tree = bagged_tr)
    bagged_pred[i,] = bagged_pred_tr
    i = i + 1
  }
  # the final prediction is the mean of the B trees prediction
  return(colMeans(bagged_pred) )
}

Let us see if the B goes bigger, will RMSE goes smaller:


set.seed(100)
cat("B = 10, bagged tree RMSE = ",sqrt (mean(( bagged_tree(X_train,y_train,l = 10,
                                               B=10,Xtest = X_test) - y_test) ^2 ) ), "\n",
    "B = 20, bagged tree  RMSE = ",sqrt (mean(( bagged_tree(X_train,y_train,l = 10,
                                               B=20,Xtest = X_test) - y_test) ^2 ) ), "\n",
    "B = 30, bagged tree  RMSE = ",sqrt (mean(( bagged_tree(X_train,y_train,l = 10,
                                               B=30,Xtest = X_test) - y_test) ^2 ) ), "\n"
    )

## B = 10, bagged tree RMSE =  312.8866
##  B = 20, bagged tree  RMSE =  309.4088
##  B = 30, bagged tree  RMSE =  295.2994

The RMSE goes smaller indeed.

Random Forest and Boosting

Concept

The idea of random forest is very similar to bagged tree model. There is only one difference that in bagged tree model, all the features in the bootstrap samples are used. But in the random forest only random subset (without replacement) of features are chosen. The random forest is supposed to give a better performance if the trees are highly correlated.
The intuition of boosting is that the training of latter tree is learning from the misclassification of the previous tree. So that the next trained tree is better than the previous tree.

For this part, we just need to fit the data into randomForest() function. ntree controls the number of bootstrap samples. To make it comparable, I also set ntree to be 10 which is the same as the previous bagged tree regression.

Code

train_df = Hitters[31:nrow(Hitters),]

rf_mod = randomForest(Salary~ . , data = train_df, ntree = 10)
rf_mod

##
## Call:
##  randomForest(formula = Salary ~ ., data = train_df, ntree = 10)
##                Type of random forest: regression
##                      Number of trees: 10
## No. of variables tried at each split: 1
##
##           Mean of squared residuals: 182725.5
##                     % Var explained: 15.89

The variable that are used is only 1. I suppose the reason is that there are only 2 variables in the X train data. According to the rule of thumb, the number of splitted variable is K/3 for regression model. In our case it is 2/3, which is rounded to be 1.

Now we can feed the randomForest function with xtest and ytest so that we can get the MSE of the test set.

set.seed(100)
rf_mod = randomForest(Salary~ . , data = train_df,xtest = X_test ,ytest = y_test, ntree = 10)
rf_mod

## Call:
## randomForest(formula = Salary ~ ., data = train_df, xtest = X_test,      ytest = y_test, ntree = 10)
##               Type of random forest: regression
##                     Number of trees: 10
## No. of variables tried at each split: 1
##
##          Mean of squared residuals: 193201.2
##                    % Var explained: 11.06
##                       Test set MSE: 67380.79
##                    % Var explained: 24.66



cat("random forest RMSE of test set =", sqrt(67380.79))

## random forest RMSE of test set = 259.5781

After reading the XG boosting documentation, I assume the parameter nrounds control the number of the tree therefore I set it to be 10 to make it comparable with the previous results.

xgb = xgboost(data = data.matrix(X_train), label = y_train,
              max.depth =5, eta = 1,
              nthread = 2, nrounds =10
       )

And the RMSE of the predictions can be calculated as follows:


y_pred <- predict(xgb, data.matrix(X_test))

cat("RMSE of xgboost = ", sqrt(mean((y_pred - y_test)^2)) )
## RMSE of xgboost =  285.6745

The RMSE is bigger than the random forest model. It could be the reason that the sample size is not big enough or I did the tune the parameters in the function well.
However, it is better than the bagged tree model.

Ending

Challenges

Conceptually, tree based methods are not difficult to understand. However, depending on your background, it might be difficult to implement them by plain coding. For example when I was coding the tree growing algorithm, I was struggled with grasping m and M. And also it is easy to code the tree growing process when depth = 2 or 3 but it could be hard to generalize it. It requires a good understanding of looping.
Nevertheless, the struggling process does help me to understand the algorithm better. I would encourage the reader to get your hand dirty by starting from scratch despite the fact that there are packages which can make it work easily.

Tips

  • The graphic illustration of how tree-based methods partitioning feature into a set of rectangles is pretty good. Please check out on ESL P306.
Prudence is a fountain of life to the prudent.