Bolin Wu
#
Classification of Image Data by Simple NN

# Neural Network Basics

# Feed-Forward Neural Network using Keras and TensorFlow

## 1. Have a grasp of MNIST

## 2. Implementation

## 3. Parameter Tuning

### (a) Increase the number of hidden units to 128

### (b) Change the activation function to reLU

### (c) Change the optimizer to RMSprop

### (d) Try to run the net for ten epochs and use early stopping for regularization

### (e) Add a second layer with 128 hidden units.

### (f) Add dropout.

## 3. Try to improve the network architecture

## Compute the accuracy, precision, and recall of the designed model

# Ending

## Challenges

## Tips

2021-02-07 · 16 min read

supervised learning
Intuitively, simple neural network is a combination of many (linear) transformations, which is similar to mixture model in some way. It allows to transform the input data in a more sophisticated way that a single linear model could not achieve. Simple neural network is the foundation for many other more advanced neural network models e.g., Recurrent Neural Network and Long Short Term Memory (LSTM). By the way, I posted a project of LSTM here please feel free to check it out if you are interested.

The content of this post includes:

- The basics of
**feedforward neural network**. - The application of it with the help of TensorFlow and Keras.
- Several useful parameter tunings.

The main reference is Deep Learning, Goodfellow et al, Chapter 6.

If you have not heard of neural network before, this video may help you to easily grasp the idea:

At first, let us implement a straight forward neural network mathematically. We assume our single hidden layer network to be:

$\begin{aligned} f(x,W,c,w,b) = w^{T} max(0,W^{T}x+c)+b \end{aligned}$

Where X is the input data, W,w,are the weight matrix, c,b are the interceps, for hidden layer and output layer respectively. These parameters are usually estimated by the backpropagation algorithm. Here for illustration we just assume:

$W = \begin{bmatrix}1 & 1\\ 1 & 1 \end{bmatrix}$

$c = \begin{bmatrix}0 \\ -1 \end{bmatrix}$

$w = \begin{bmatrix}1 \\ -2 \end{bmatrix}$

and b =0.

The input data:

$X = \begin{bmatrix}0 & 0\\ 0 & 1\\ 1 & 0\\ 1 & 1 \end{bmatrix}$

The basic steps of implementing the network above:

- Transform the original data X by using weight W and intercept c.
- Send the transformed data into activation function, get output X'.
- Transform the X' by using the weight w and intercept b. Then get the final output.

The max() function part is actually a ReLU activation function and this example is a realization of XOR logic gate.

R Code:

```
# load the package
library(sigmoid)
library(keras)
library(kerasR)
library(tensorflow)
W = matrix(rep(1,4),nrow = 2); c = matrix(c(0,-1), nrow = 2);
w = matrix(c(1,-2), nrow = 2)
X = matrix(c(0,0,0,1,1,0,1,1), byrow = T, nrow = 4, ncol = 2)
network = function(x_input,W_input, c_input, w_input, b_input){
# transfor c_input so that it can be used for addition in the next step
c_trans = matrix(NA, nrow = nrow(x_input %*% W_input), ncol = nrow(c_input))
for (i in 1: nrow(c_input) ) {
c_trans[,i] = rep(c_input[i],nrow(x_input %*% W_input) )
}
# (6.8) and (6.9)
layer_trans1 = x_input %*% W_input + c_trans
# put into activation function, (6.10)
activation_1 = relu(layer_trans1)
# output, (6.11)
out_1 = activation_1 %*% w + b_input
return(out_1)
}
network(X,W_input = W,c_input = c,w_input = w,b_input =0)
[,1]
[1,] 0
[2,] 1
[3,] 1
[4,] 0
```

Now let us step into the application part. You can find find how to install keras and tensorflow in R here.

The data that we will use is the classic MNIST dataset from keras which contains a huge amount of hand-written digits. The data could be easiliy loaded as follows:

```
library(keras)
mnist <- dataset_mnist()
# scale the dataset
mnist$train$x <- mnist$train$x/255
mnist$test$x <- mnist$test$x/255
```

Let us first have a grasp about the dataset by visualizing a digit. You could play around the code and explore the dataset in your own way.

```
idx <- 3
im <- mnist$train$x[idx,,]
# Transpose the image
im <- t(apply(im, 2, rev))
image(1:28, 1:28, im, col=gray((0:255)/255), xlab = "", ylab = "",
xaxt='n', yaxt='n', main=paste(mnist$train$y[idx]))
```

*Fig.1 Digit Visualization*

And we can also use object.size()function to see the size of MNIST dataset:

```
cat("The training data set is",object.size(mnist$train),"bytes;","\n",
"The test data set is",object.size(mnist$test),"bytes.")
The training data set is 376560792 bytes;
The test data set is 62760792 bytes.
```

Next, we will start with on hidden layer with 16 units and the sigmoid as the avtivation function, without any regularization.

```
# set up the model
model <- keras_model_sequential() %>%
layer_flatten(input_shape = c(28, 28)) %>%
layer_dense(units = 16, activation = "sigmoid")
%>%
# ouput layer
layer_dense(10, activation = "softmax")
# compile the model
model %>%
compile(
loss = "sparse_categorical_crossentropy",
optimizer = "adam",
metrics = "accuracy"
)
# fit the model
model %>%
fit(
x = mnist$train$x, y = mnist$train$y,
epochs = 5,
validation_split = 0.3,
verbose = 2
)
Epoch 1/5
1313/1313 - 1s - loss: 0.5673 - accuracy: 0.8667
1313/1313 - 2s - loss: 0.5673 - accuracy: 0.8667 - val_loss: 0.3418 - val_accuracy: 0.9087
Epoch 2/5
1313/1313 - 1s - loss: 0.3237 - accuracy: 0.9125
1313/1313 - 2s - loss: 0.3237 - accuracy: 0.9125 - val_loss: 0.3124 - val_accuracy: 0.9139
Epoch 3/5
1313/1313 - 1s - loss: 0.2981 - accuracy: 0.9174
1313/1313 - 2s - loss: 0.2981 - accuracy: 0.9174 - val_loss: 0.2998 - val_accuracy: 0.9169
Epoch 4/5
1313/1313 - 1s - loss: 0.2841 - accuracy: 0.9212
1313/1313 - 2s - loss: 0.2841 - accuracy: 0.9212 - val_loss: 0.2951 - val_accuracy: 0.9199
Epoch 5/5
1313/1313 - 1s - loss: 0.2753 - accuracy: 0.9235
1313/1313 - 2s - loss: 0.2753 - accuracy: 0.9235 - val_loss: 0.2874 - val_accuracy: 0.9236
```

*Training Process*

The accuracy after 5 epochs is 0.9235.

Please note that in the setup part, the last dense layer is an ouput layer. Its units number has to be 10 because there are ten digits and we are interested to classyfy these 10 circumstances. If in the future we would like to do yes/no classification, then the unit should be 1.

The default epoch number is 5, hidden layer is 16 units and the activation function is sigmoid.

```
# set up the model
model <- keras_model_sequential() %>%
layer_flatten(input_shape = c(28, 28)) %>%
layer_dense(units = 128, activation = "sigmoid") %>% # increase unit to 128
layer_dense(10, activation = "softmax")
# compile the model
model %>%
compile(
loss = "sparse_categorical_crossentropy",
optimizer = "adam",
metrics = "accuracy"
)
# fit the model
model %>%
fit(
x = mnist$train$x, y = mnist$train$y,
epochs = 5,
validation_split = 0.3,
verbose = 2
)
```

*Training Process*

The validation accuracy after 5 epochs is around 0.9607.

```
# set up the model
model <- keras_model_sequential() %>%
layer_flatten(input_shape = c(28, 28)) %>%
layer_dense(units = 128, activation = "relu") %>% # change to ReLU
layer_dense(10, activation = "softmax")
# compile the model
model %>%
compile(
loss = "sparse_categorical_crossentropy",
optimizer = "adam",
metrics = "accuracy"
)
# fit the model
model %>%
fit(
x = mnist$train$x, y = mnist$train$y,
epochs = 5,
validation_split = 0.3,
verbose = 2
)
```

*Training Process*

The validation accuracy after 5 epochs is around 0.9708.

Sebastian Ruder has an excellent papper An overview of gradient descent optimization algorithms which illustrates the different optimizers. If you would like to know what are optimizers and the differences between them, please take a look.

```
# set up the model
model <- keras_model_sequential() %>%
layer_flatten(input_shape = c(28, 28)) %>%
layer_dense(units = 128, activation = "relu") %>%
layer_dense(10, activation = "softmax")
# compile the model
model %>%
compile(
loss = "sparse_categorical_crossentropy",
optimizer = "RMSprop", # change to RMSprop
metrics = "accuracy"
)
# fit the model
model %>%
fit(
x = mnist$train$x, y = mnist$train$y,
epochs = 5,
validation_split = 0.3,
verbose = 2
)
```

*Training Process*

The validation accuracy after 5 epochs is around 0.9678, not changed much.

The early stopping means that the process stops training when a monitored metric has stopped improving. We can do so by adding *"callbacks = callback_early_stopping(monitor = "val_loss",patience = 3)"* in the model fitting sequence as is shown above. The *"patience"* parameter is the number of epochs with no improvement after which training will be stopped.

The early stopping is one of the remedies for overfitting.

```
# set up the model
model <- keras_model_sequential() %>%
layer_flatten(input_shape = c(28, 28)) %>%
layer_dense(units = 128, activation = "relu") %>%
layer_dense(10, activation = "softmax")
# compile the model
model %>%
compile(
loss = "sparse_categorical_crossentropy",
optimizer = "RMSprop", # change to RMSprop
metrics = "accuracy"
)
# fit the model
model %>%
fit(
x = mnist$train$x, y = mnist$train$y,
epochs = 10,
validation_split = 0.3,
verbose = 2,
# early stopping
callbacks = callback_early_stopping(monitor = "val_loss",patience = 3)
)
```

*Training Process*

The validation accuracy after 5 epochs is around 0.9716 which is slightly improved.

```
# set up the model
model <- keras_model_sequential() %>%
layer_flatten(input_shape = c(28, 28)) %>%
layer_dense(units = 128, activation = "relu") %>%
layer_dense(units = 128, activation = "relu") %>% # add a second layer
layer_dense(10, activation = "softmax")
# add a second layer
# compile the model
model %>%
compile(
loss = "sparse_categorical_crossentropy",
optimizer = "RMSprop",
metrics = "accuracy"
)
# fit the model
model %>%
fit(
x = mnist$train$x, y = mnist$train$y,
epochs = 10,
validation_split = 0.3,
verbose = 2
)
summary(model)
Model: "sequential_15"
_________________________________________________________________________________
Layer (type) Output Shape Param #
=================================================================================
flatten_15 (Flatten) (None, 784) 0
_________________________________________________________________________________
dense_37 (Dense) (None, 128) 100480
_________________________________________________________________________________
dense_38 (Dense) (None, 128) 16512
_________________________________________________________________________________
dense_39 (Dense) (None, 10) 1290
=================================================================================
Total params: 118,282
Trainable params: 118,282
Non-trainable params: 0
_________________________________________________________________________________
```

*Training Process*

The validation accuracy after 10 epochs is around 0.9750. If we use the summary function we can see that the number of total parameters with two hidden layers is 118,282, which is pretty huge compared with normal statistical models. One benefit is that it gives a pretty high classification accuracy, but the disadvantages could be the long time consumption of model training as well as potential overfitting problem. Next I will introduce **dropout** which could be a remedy for the mentioned disadvantages.

In practice you could choose what layers you want to implement dropout. Here I introduce dropout (p=0.2) to the first layer and dropout (p=0.5) to the second layer.

Without dropout, **every** node in a hidden layer is connected with **every** node in the next hidden layer. With dropout,the nodes in a hidden layer will be **excluded** with a given probability therefore it will fasten the training process as well as preventing overfitting to some extent.

```
# set up the model
model <- keras_model_sequential() %>%
layer_flatten(input_shape = c(28, 28)) %>%
layer_dense(units = 128, activation = "relu") %>%
layer_dropout(0.2) %>% # introduce dropout
layer_dense(units = 128, activation = "relu") %>%
layer_dropout(0.5) %>%
layer_dense(10, activation = "softmax")
# compile the model
model %>%
compile(
loss = "sparse_categorical_crossentropy",
optimizer = "RMSprop",
metrics = "accuracy"
)
# fit the model
model %>%
fit(
x = mnist$train$x, y = mnist$train$y,
epochs = 10,
validation_split = 0.3,
verbose = 2
)
```

*Training Process*

The validation accuracy after 10 epochs is around 0.9701.

Now, with all the tuning methods given above, you could play round and build a network that gives best accuracy.

After many trials of different parameters, I found the network with the best validation accuracy = 0.9725.

The architecture is:

- Two hidden layers with the first activation function to be sigmoid and the second to be softmax. Both units are 128. The output layer has softmax activation function with 10 units.
- Add drop out with 0.3 probability for each hidden layer. This can help accelerate the training process and prevent overfitting to some extent.
- The optimizer is Adam method.
- Set epochs = 50 and use early stopping (patiece = 3). I increase the number of epoch in case that the net work fails to get to its optimal model for lack of iteration. And the early stopping is used so that it can stop when its accuracy stop growing for three consecutive epochs.

The code is listed below:

```
# change apoch, drop layout, add early stopping
# add another layer
# library(kerasR)
# set up the model
model <- keras_model_sequential() %>%
layer_flatten(input_shape = c(28, 28)) %>%
layer_dense(units = 128, activation = "sigmoid") %>%
layer_dropout(0.3)%>%
layer_dense(units = 128, activation = "softmax") %>%
layer_dropout(0.3)%>%
layer_dense(10, activation = "softmax")
# compile the model
model %>%
compile(
loss = "sparse_categorical_crossentropy",
optimizer = "adam",
metrics = "accuracy"
)
# fit the model
model %>%
fit(
x = mnist$train$x, y = mnist$train$y,
epochs = 50,
validation_split = 0.3,
verbose = 2,
callbacks = callback_early_stopping(monitor = "val_loss",patience = 3)
)
```

Here we can use evaluate() funciton to get the accuracy directly.

And to get precision and recall, I need to first use predict_classes() funciton to get the prediction in integer, and then use confutionMatrix in caret package.

The accuracy = 0.9741;

Precision | Recall | |
---|---|---|

Class:0 | 0.9807 | 0.9867 |

Class: 1 | 0.9852 | 0.9938 |

Class: 2 | 0.9748 | 0.9748 |

Class: 3 | 0.9546 | 0.9792 |

Class: 4 | 0.9845 | 0.9715 |

Class: 5 | 0.9664 | 0.9686 |

Class: 6 | 0.9792 | 0.9812 |

Class: 7 | 0.9718 | 0.9718 |

Class: 8 | 0.9733 | 0.9713 |

Class: 9 | 0.9806 | 0.9504 |

In the end I would like to share some challenges that I met when I was implementing neural network (NN).

First, understanding the feedforward NN structure in tensorflow. Before I ommited the fact that the last dense layer should be output layer and thought it was set up by the API by default. It is important to read documentation carefully.

Second, we only need to use pipeline when compiling and fitting model. I did not understand why in the compliling part and fitting part, only pipeline is needed but we do not need to store it. I guess the reason could be that the tensorflow API is developped by Python or some other language as foundation therefore it does not follow our intuition of using R.

```
model %>% compile()
model %>% fit()
# but we do not need to use
model = model %>% compile()
# or
model = model %>% fit()
```

- This post does not discuss some important concepts e.g., backpropagation and gradient descent, but they are worth checking out.
- It is beneficial to read the summary of compiled model and calculate the number of parameters again by hand. This could help you comprehend the setup better.

Hopefully this post can be helpful to you. Thank you for reading.