Bolin Wu

Evaluate Wine by LSTM and Simple NN

Evaluate Wine by LSTM and Simple NN
2021-01-10 · 13 min read
R supervised learning

This project is focused on solving the question: Is it possible to let the machine evaluate a wine like a sommelier?
The answer is yes! With the help of simple Neural Network and Long short-term memory(LSTM), we can make it possible.

Prerequisite to read the following blog:

  • Basic knowledge of Neural Network and LSTM.
  • Basic knowledge of R programming, tensorflow and functional API.

This project is done together with my teammate Zhenyu Zhao. It costs us a lot of time to finish but we really enjoy the process. Therefore I am writing this blog to share what we have learnt. The publishment is under Zhenyu's permission. If you have any question please let me know. My contact is available at the front page.



The data are derived from Kaggle. The data set has 150,000 observations (original) and 6 variables (after selection) as is listed below.

  • Points: the points rated the wine on a scale of 80-100 (only when score of the wine \geq 80 its review would be posted)
  • Description: a few sentences from a sommelier describing the wine's taste, smell, look, feel, etc.
  • Variety: the type of grapes used to make the wine
  • Country: the country that the wine is from
  • Province: the province or state that the wine is from
  • Price: the cost for a bottle of the wine


After we have downloaded the data, we have to do some data cleaning:

  1. Select the 6 variables of interested, as is shown above.
  2. Remove the observations that has NA value.
  3. Remove the observations that has less than 50 observations in each variety, country and province group.
  4. We filter out the description that has longer than 100 words.

In the end, the dataset has roughly 120,000 observations.

The reason why we do number 3 is that if there are too less observations in each group, then we can not train the model well. The reason for number 4 because the max length of words we will train is 100 since 99% of the descriptions are lower than 100 words. Another reason that we filter out the too long descriptions is that in long description, sometimes the real intention is shown in latter part of a description, but the model just look at the first 100 words, therefore the long descriptions may “mislead” the model.

# load the package

df = read_csv("wine_150k_data.csv",col_names = T)

#---------------- Description column cleaning ---------------#
# select the variables that we need              
df = df %>% select(country,description, points, province, variety, price)
df = na.omit(df)

# check length of description to determine the length for LSTM input   
unique_obs <- df %>%
  group_by(variety, country, province) %>%
  summarize(n = n()) %>%
  filter(n < 50)

sum(unique_obs$n) # check obs number to be deleted

df <- df %>%

# find the length
desc_len <- df$description %>%
  strsplit(" ") %>%

# filter out the description that are too long
too_long <- which(desc_len > 100)
df <- df[-too_long, ]

# split training and testing sets.
training_id <-, size = nrow(df)*0.8)
training <- df[training_id,]
testing <- df[-training_id,]

Chosen models

Before we introduce the models, let’s review the variables that we have again. The data consists of 3 types of variables: numerical, categorical and textual variables. We know that RNN or LSTM can be used to deal with a textual data consist of sentences. LSTM is similar to Recurrent Neural Network, one difference is that it saves information for later, thus preventing vanishing gradient to some extent therefore we use LSTM instead of RNN. And simple neural network is good at dealing with numeric data. Therefore we would like to combine them together to predict the points of a wine. Hopefully the figure below can help you understand the structure.

The numeric data is price. The categorical data is country, province and variety. The textual data is description. We treat the textual data as main input, numeric and categorical data as auxiliary input. These two branches are separately set up and then concatenated together. The concatenated layer is fed to a final simple Neural Network to make prediction of points.

One thing worth noticing that the way that tensorflow handels categorical and numeric data is that it creates a feature space based on the available dataset which we can specify which column is numeric and which column is categorical. The way that it deals with categorical data is that it maps every single word by one-hot encoding.

It is an interesting to see that how would it perform. Would it outperform/underperform any single neural network? We will also discuss it in the following sections.

Model setup code

As is mentioned above, there are two branches. The functions that are used in bove are mainly from the API. Let us see how they are setup separately.

The textual part

#------------------ texual part --------------#

# Define the number of tokens and max length of each
# description
num_words <- 10000
max_length <- 100
text_vectorization <- layer_text_vectorization(
  max_tokens = num_words,
  output_sequence_length = max_length,

# these are built-in function in tensorflow
text_vectorization %>%



input <- layer_input(shape = c(1), dtype = "string")

output <- input %>%
  text_vectorization() %>%
  layer_embedding(input_dim = num_words + 1, output_dim = 16
                  ,input_length = max_length) %>%
  # layer_global_average_pooling_1d() %>%
  layer_lstm(units = 32) %>%
  layer_dense(units = 16, activation = "relu") %>%
  # layer_dropout(0.5) %>%
  layer_dense(units = 1, activation = "linear")

model <- keras_model(input, output)

model %>% compile(
  optimizer = optimizer_rmsprop(),
  loss = 'mse',
  metrics = list('mean_squared_error')

model_tex <- model %>% fit(
  training$description, training $points,
  epochs = 5,
  batch_size = 128,
  validation_split = 0.2,

# prediction
pred_tex <- predict(model,  testing$description )

# MSE for model comparison
mse_text<- sum( (pred_tex - testing$points)^2 ) / nrow(testing)

The LSTM model consists of an input layer, a text vectorization layer, an embedding layer with dimension (100, 16), a LSTM layer with 16 units, a hidden layer of 16 units and an output layer in the end with 1 unit. The dimension of embedding layer is (100,16) because 99% of descriptions are within 100 words and we want to map every word to a space with 16 abstract features.

The categorical and numeric part

#### --------------- categorical part ---------------####

# set up the feature space
spec <- feature_spec(training, points ~ variety + country + province + price) %>%
    normalizer_fn = scaler_standard()
  ) %>%
  step_categorical_column_with_vocabulary_list(country, province, variety) %>%
  step_indicator_column(country, province, variety) %>%
  step_embedding_column(country, province, variety, dimension = 16)

spec_prep <- fit(spec)

input <- layer_input_from_dataset(training %>% select(variety, country, province, price))
output <- input %>%
  layer_dense_features(dense_features(spec_prep)) %>%
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 1, activation = "linear")

model <- keras_model(input, output)



model %>% compile(
  optimizer = optimizer_rmsprop(),
  loss = 'mse',
  metrics = list('mean_squared_error')

history <- model %>%
    x = training %>% select(variety, country, province, price),
    y = training$points,
    epochs = 5,
    validation_split = 0.2

# prediction
pred_num_cat <- predict(model,  testing%>% select(variety, country, province, price) )

# MSE for model comparison
mse_num_cat  <- sum( (pred_num_cat - testing$points)^2 ) / nrow(testing)

The simple Neural Network for categorical and numeric data consists of 4 input layers for 4 variables respectively, a feature layer, a hidden layer with 32 units and an output layer with 1 unit. The feature layer is to map the categorical data and numeric data to a feature space in a way that the computer can understand.

The concatinate part

# similar to the previous
main_input <- layer_input(shape = c(1), dtype = "string", name = 'main_input')

lstm_out <- main_input %>%
   text_vectorization() %>%
  layer_embedding(input_dim = num_words + 1, output_dim = 16
                  ,input_length = max_length) %>%
  layer_lstm(units = 16 )
## ---------------

## ---------------

# cate and num
auxiliary_input <- layer_input_from_dataset(training %>%
                                              select(variety, country, province, price))

auxiliary_output <- auxiliary_input %>%
  layer_dense_features(dense_features(spec_prep)) %>%
  layer_dense(units = 32, activation = "relu")

main_output <- layer_concatenate(c(lstm_out, auxiliary_output)) %>%
  # the final simple NN
  layer_dense(units = 32, activation = "relu") %>%
  layer_dense(units = 1, activation = "linear", name = 'main_output')

model <- keras_model(
  inputs = c(main_input, auxiliary_input),
  outputs = c(main_output)

# another input end
model_comv %>% compile(
  optimizer = "rmsprop",
  loss = list(main_output = 'mse'),
  metrics = list(main_output = 'mean_squared_error'),
  loss_weights = list(main_output = 1.0)

# And trained it via:
history =  model_comv %>% fit(
  x = list(training$description,
           training %>% select(variety, country, province, price)),
  y = list(main_output = training$points),
  epochs = 5,
  batch_size = 32,
  validation_split = 0.2,

# prediction
pred_comb <- predict(model_comv, list(testing$description,
                                      testing %>% select(variety, country, province, price)) )

# MSE for model comparison
mse_comb <- sum( (pred_comb - testing$points)^2 ) / nrow(testing)

The combined model concatenates of the two models specified above with the help of layer_concatenate() function. The concatenate layer combines the input layers & first hidden layer of simple NN and input layer & embedding layer & LSTM layer of LSTM model. After that, the concatenated input is sent to a hidden layer with 32 units and the final output layer with 1 unit.

Evaluation of the methods

Now it is the exciting moment! How do these models perform in prediction, let us see!

There are two matrics that we are using to evaluate, MSE and accuracy.
MSE is defined as follows:

MSE=1nin(YiYi^)2 MSE = \frac{1}{n} \cdot \sum_{i}^{n} (Y_{i} - \hat{Y_{i}})^2

YiY_{i} is the real point of a wine, Yi^\hat{Y_{i}} is the predicted point and n is the total number of observations.

Since the prediction is numeric, accuracy of prediction is calculated in following way: If the prediction is within true value ±\pm threshold, it is regarded as an accurate prediction. Then we calculate the proportion of accurate predictions in the test data set. The threshold is grinding from 1 to 10.5 with step size 0.5.

Model Validation MSE Prediction MSE
LM with price 8.40 8.09
LM with price/variety/ province 7.56 7.36
Simple NN 6.49 6.35
LSTM 10.54 10.58
Combined NN & LSTM 6.28 6.13


Based on MSE, the combination model has the best prediction performance which is around 6.1. The simple NN based on categorical data and numeric data has the similar performance. The LSTM has the worst performance. This could be due to Bayes error. Similarly, for human it is comparatively easier to judge a wine by price and origin than reading a long description.

However, from MSE and accuracy we can see that the concatenated model is still the best instead of being somewhere between Simple NN and LSTM. It is encouraging because it indicates the validity of concatenation.

Potential problems, improvements and ethical issues


  • The variable "points" is ranged from 80-100 which could be too narrow for the training and evaluating.
  • The original data set has 150,000 observations, but after the data cleaning there is only 110,000 observations left for modeling. The data set perhaps is not big enough.
  • There is a variable called "winery" in the original data set, but it has non-English letters in it so that we exclude the variable although it could be an important factor.
  • Because of calculation power limitation, the parameters such as hidden units in each layer, number of epoch are all limited to a small size.


  • Increasing the number of epoch and using early stopping to make sure that the training reaches its best stage.
  • Increasing the number of units in the hidden layers or using dropout to increase the complexity of the model or prevent over-fitting.
  • It might be better to handle different type of variables separately, so set up different input branches for numerical/ categorical variables and tune them separately may be a good choice.

Ethical issues

In the end I would like to talk about ethical issus.
Since R could only recognizes letters in English, we do not concern the observations which has languages other than English in the interested variables. Therefore it might be biased for not concerning wine originated from non-English speaking regions or the descriptions in other languages, for example French.

Thank you for reading!

Prudence is a fountain of life to the prudent.