Bolin Wu

NLP 3: Text Classification in Python

NLP 3: Text Classification in Python

In the previous two posts, I have shared basic concepts and useful functions of text mining and NLP. In this third post of text mining in Python, we finally proceed to the advanced part of text mining, that is, to build text classification model. In this post I will share the main tasks of text classification. Two useful classification models, their implementation in Python and methods of improving classification performance.

Classification of Text

Text classification is one of the most interested topics in machine learning. Some examples include:

  • Topic identification: Is an article about sports or technology?
  • Spam detection: Is an email a spam or not?
  • Sentiment analysis: Is a movie review positive or negative?
  • Spelling correction: "Weather" or "whether"? "Color" or "colour"?

Supervised learning

There are two phases of supervised learning: training phase and inference phase.

At the training phase, we need to know

  1. what are the features, how could we represent them?
  2. What is the appropriate classification model?
  3. What are the model parameters?

At the inference phase, we need to know

  1. what is the expected performance?
  2. How to measure the performance?

Identifying features from text

Textual data is unique in a way that features can be pulled out from the text in different granularities.

The basic features in text is a set of words. For example, in English language, there are about 40,000 unique words. So you would have 40,000 features in common English. However, this number grows much larger in social media field since there are more unique word spellings.

After we get so many features, one of the questions would be how to handle commonly-occuring words? In some cases, they are called stop words, like "the". If we want to identify whether an article belongs to sport class, the word "shoot" is more important than "the".

The next step is normalization. In some cases we would like to make the words lowercase so that the extra feature of the same meaning is added. However, in some cases we may want to leave it as it is. For example, US in capitals would be the United States. Whereas if we make it in lowercase then it would be indistinguishable from the word "us". We need to make the choice.

There are also issues with stemming and lemmatization. For example, we do not want the plural nouns to be different features.

Naive Bayes Classifiers

Naive bayes classifiers are one of the most commonly used classification models. The strength of this model is that it fits for both large and small data size and its speed is faster than Neural Network or Gradient Boosting Tree. The short-comming is that this model is not well explainable.

Nive Bayes classifier is called naive because it assumes features are independent of each other, given the class label. For text classification tasks, it is considered as a very strong baseline model.

Example and intuition

To illustrate naive bayes classifiers, let us start with an example. Suppose we are interested in classifying search queries in three classes: Entertainment, Computer Science and Zoology. The most common class of the three is Entertainment and the least common class is Zoology (prior knowledge). If we get the query "Python", shall we classify it as entertainment, computer science or zoology? This word could be the snake (Zoology), or the programming language (Computer Science) or as in Monty Python (Entertainment). Given the word "Python", it is more likely to be Zoology than Entertainment. Given the words "download Python", it is more likely to be Computer Science than Zoology.

The intuition behind naive bayes classifier is that we update the likelihhod of the class given new information. We have prior probability: Pr(y = Entertainment), Pr(y = CS), Pr(y = Zoology). If we do not have any information, we may say Pr(y = Entertainment) is the largest. When there is new information comes in, we have posterior probability : Pr(y = Entertainment | x = "Python") and the updated probability may tell us that it is less likely to be Entertainment class.

According to the Bayes' Rule:

Posterior probability=Prior probability×LikelihoodEvidence\text{Posterior probability} = \frac{ \text{Prior probability} \times \text{Likelihood} }{\text{Evidence}}

Pr(yX)=Pr(y)Pr(Xy)Pr(X)Pr(y|X) = \frac{Pr(y)Pr(X|y)}{Pr(X)}

In our example it becomes:

Pr(y=CSX="Python")=Pr(y=CS)Pr("Python"y=CS)Pr("Python")Pr(y = CS|X = \text{"Python"}) = \frac{Pr(y = CS)Pr(\text{"Python"}|y = CS)}{Pr(\text{"Python"})}

In the naive bayes classification task, we are interested in finding

y=argmaxyPr(yX)=argmaxyPr(y)×Pr(Xy)y^{*} = \underset{y}{\operatorname{argmax}} Pr(y|X) = \underset{y}{\operatorname{argmax}} Pr(y) \times Pr(X|y)

We can see that the denominator is removed because when given X, the Pr(X) is a constant. We only interested in finding the largest probability of y.

By using the Naive assumtion: Given the class label, features are assumed to be independent of each other, we have:

y=argmaxyPr(yX)=argmaxyPr(y)×i=1nPr(xiy)y^{*} = \underset{y}{\operatorname{argmax}} Pr(y|X) = \underset{y}{\operatorname{argmax}} Pr(y) \times \prod_{i=1}^{n} Pr(x_{i}|y)

If we have the query "Python download", we would have:

y=argmaxyPr(y)×Pr("Python"y)×Pr("download"y)y^{*} = \underset{y}{\operatorname{argmax}} Pr(y) \times Pr(\text{"Python"}|y) \times Pr(\text{"download"}|y)

where y = "CS", "Entertainment" or "Zoology".

What are the parameters?

  • Prior probabilities: Pr(y) for all y in Y.
  • Likelihood Pr(xiy)Pr(x_{i}|y) for all features xix_{i} and labels y in all Y.

Both of them can be required simply by counting the number of instances.

Support vector machines

Support Vector Machine (SVM) is also one of the first models that we should try when solving classification tasks. The advantages of SVM are that they have strong theoretical foundation and it tends to be the most accurate classifiers, especially in high-dimensional data. Here I would not go through the technical details but share some key points of using SVM instead.

Applicable for numeric features

SVM uses numbers to decide where to locate the boundaries. That being said, when we have categorical features, we have to convert it into numeric features.

Normalization

When we use SVM we usually normaliza the features in to 0-1 range because we do not want one dimension to be pretty high and the other to be very low.

Parameters

  • C: This is the parameter of regularization. Larger values of C lead to less regularization. It encourage fitting training data as well as possible. Every data point is important.
  • Kernals: There are linear kernels, RBF kernel, and polynomial kernel, etc. Usually Linear kernels work best for text data.
  • multi_class: Indication of whether the label is binary class or multiple class. If it is multiple class we would choose ovr (one-vs-rest) instead of one vs one as it trains less classifiers.

Toolkits for supervised Learning

In Python there are quite a few available toolkits for supervised text classification.

  • Scikit-learn: An open-source Machine Learning Library created by Google.
  • NLTK: It interfaces with sklearn and other ML toolkits.

Following is a snippet of code of training classifier and make prediction.

# train naive bayes classifier
# import library
from sklearn import naive_bayes
clfrNB = naive_bayes.MultinomialNB()

# train the NBC model
clfr.NB.fit(train_data, train_labels)

# predict label for new data set
predicted_label = clfrNB.predict(test_data)

# evaluate the model
metrics.f1_score(test_label, predicted_label, average = 'micro')

# train SVM classifier
from sklearn import svm
clfrSVM = svm.SVC(kernel = 'linear', C = 0.1)

# train the SVM model
clfrSVM.fit(train_data, train_label)

# make prediction
predicted_labels = clfrSVM.predict(test_data)

Spam detection study

We have learnt the theoretical understanding of text classification. Now let us dive into the application. In this case study we will explore text message data and create classification model to predict if a document is spam or not.

Import data and take preliminary inspection

Data is available here

# import data from google drive
# use the following code if want to connect colab to google drive
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive
import pandas as pd
import numpy as np

spam_data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Applied_Text_Mining_in_Python/TextClassification/spam.csv')

spam_data['target'] = np.where(spam_data['target']=='spam',1,0)
spam_data.head(10)
text target
0 Go until jurong point, crazy.. Available only ... 0
1 Ok lar... Joking wif u oni... 0
2 Free entry in 2 a wkly comp to win FA Cup fina... 1
3 U dun say so early hor... U c already then say... 0
4 Nah I don't think he goes to usf, he lives aro... 0
5 FreeMsg Hey there darling it's been 3 week's n... 1
6 Even my brother is not like to speak with me. ... 0
7 As per your request 'Melle Melle (Oru Minnamin... 0
8 WINNER!! As a valued network customer you have... 1
9 Had your mobile 11 months or more? U R entitle... 1
spam_data.shape
(5572, 2)

The total number of records is 5572.

# split data into training set and test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], 
                                                    spam_data['target'], 
                                                    random_state=0)
X_train.shape
(4179,)
X_train.head(10)
872                       I'll text you when I drop x off
831     Hi mate its RV did u hav a nice hol just a mes...
1273    network operator. The service is free. For T &...
3314    FREE MESSAGE Activate your 500 FREE Text Messa...
4929    Hi, the SEXYCHAT girls are waiting for you to ...
4249                              How much for an eighth?
3640    You can stop further club tones by replying \S...
1132                  Good morning princess! How are you?
3318                     Kay... Since we are out already 
5241                            Its a part of checking IQ
Name: text, dtype: object

What percentage of the documents in spam_data are spam?

print("The percentage of spam documents are {}".format(spam_data['target'].mean()))
The percentage of spam documents are 0.13406317300789664

What is the average length of documents (number of characters) for not spam and spam documents?

spam_text = spam_data[spam_data['target'] ==1].loc[:,'text']
ham_text = spam_data[spam_data['target'] ==0].loc[:,'text']
avg_len_spam = sum([len(w) for w in spam_text]) / len(spam_text)
avg_len_ham = sum([len(w) for w in ham_text]) / len(ham_text)
{'avg for spam':avg_len_spam,'avg for not spam':avg_len_ham}
{'avg for not spam': 71.02362694300518, 'avg for spam': 138.8661311914324}

The average length of spam is longer than not spam's.

What is the average number of digits per document for not spam and spam documents?

spam_text_DigitLen = spam_text.str.findall('(\d)').str.len()
ham_text_DigitLen = ham_text.str.findall('(\d)').str.len()

{'not spam':sum(ham_text_DigitLen)/len(ham_text), 'spam':sum(spam_text_DigitLen)/len(spam_text)}
{'not spam': 0.2992746113989637, 'spam': 15.759036144578314}

There are less digits per document for not spam documents.

What is the average number of non-word characters per ducument for not spam and spam documents?

spam_text_NonWordLen = spam_text.str.findall('\W').str.len()
ham_text_NonWordLen = ham_text.str.findall('\W').str.len()
{'not spam':sum(ham_text_NonWordLen)/len(ham_text), 'spam':sum(spam_text_NonWordLen)/len(spam_text)}
{'not spam': 17.29181347150259, 'spam': 29.041499330655956}

The average number of non-word characters per document is smaller for the not spam.

Fit the training data X_train using a Count Vectorizer with default parameters.

One thing worth noticing is that the computer can not deal with text directly. We have to convert text into a numeric representation that scikit-learn can use. The bag of words approach is a commonly used way to represent text in machine learning. It ignores structure and only counts the frequency of each word's occurance. Count Vectorizer allows us to use the bag-of-word approach by converting a collection of text innto a matrix of token counts.

Fitting the Count Vectorizer consists of tokenization of the trained data and builing of the vocabulary. It tokenizes each document by finding all sequences of characters of at least two letters or numbers seperated by word boundaries. It converts everything to lowercase and builds a vocabulary using these tokens.

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer().fit(X_train)


We can get the vocabulary by using the get_feature_names

vect.get_feature_names()[::1000]
['00', 'arnt', 'csh11', 'goggles', 'loverboy', 'point', 'soup', 'wasted']

Looking at the every 1000th feature, we can have a small sense of what the vocabulary looks like. It is pretty messy, including misspellings and numbers.

Check the total number of features (words)

# check the length of total features
len(vect.get_feature_names())
7354

By checking the length of get_feature_names, we can see that we are working with over 7000 features.

Find the longest word

max_len = max([len(w) for w in vect.get_feature_names()])
longest_word = [w for w in vect.get_feature_names() if len(w) == max_len]
# convert from list to string
''.join(longest_word)
'customer service representative'

Hmmmm, interesting, the "longest word" is not even a real word.

Build classification model

Here I will show the procedures of implementing models and adding features. The final goal is to compare the classification performance of different models according to AUC score.

Transform input data

Firstly, we need to fit and transform the training data X_train by uting the transform method. It gives us the bag-of-word representation of X_train. This representation is stored in a SciPy sparse matrix, where each row correspnds to a document and each column a word from our training vocabulary. The entries in this matrix are the number of times each word appears in each document. It is called sparse matrix because the number of words in the vocabulary is much larger than the number of words appear in a sigle review, most entries of this matrix are zero.

X_train_vectorized = vect.transform(X_train)
X_train_vectorized.shape
(4179, 7354)

We can see the number of rows equal to the size of training set and the number of columns equal to the number of features.

Fit model and evaluate prediction

Next, fit a fit a multinomial Naive Bayes classifier model with smoothing alpha=0.1. Find the area under the curve (AUC) score using the transformed test data.

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn import naive_bayes


clfrNB = naive_bayes.MultinomialNB(alpha=0.1)
# train the NBC model
clfrNB.fit(X_train_vectorized, y_train)
predictions = clfrNB.predict(vect.transform(X_test))
ROC_score = roc_auc_score(y_test,predictions)
ROC_score
0.9720812182741116

Transform data by tf-idf instead

Term frequency-inverse document frequency (tf-idf) allows us to weigh terms based on how important they are to the document. High weight is given to terms that appear oten in a particular document but do not appear aften in the corpus. Features with low tf-idf are either commonly used across all documents or rarely used and only occur in long documents. Features with high tf-idf are frequently used within specific documents but rarely used across all documents.

let us fit and transform the training data X_train using a Tfidf Vectorizer with default parameters.

What 10 features have the smallest tf-idf and what 10 have the largest tf-idf?

from sklearn.feature_extraction.text import TfidfVectorizer
# transform with tfidf
vect = TfidfVectorizer().fit(X_train)
# get feature name
feature_names = np.array(vect.get_feature_names())
# transform the vectorized data to sparse matrix representation
X_train_vectorized = vect.transform(X_train)
sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()


# the smallest tfidf
tfidf_smallest = pd.Series(data = sorted(X_train_vectorized.max(0).toarray()[0])[:20], index = feature_names[sorted_tfidf_index[:20]])

# the largest tfidf
tfidf_largest = pd.Series(data = sorted(X_train_vectorized.max(0).toarray()[0])[:-21:-1], index = feature_names[sorted_tfidf_index[:-21:-1]])
tfidf_smallest
sympathetic     0.074475
healer          0.074475
aaniye          0.074475
dependable      0.074475
companion       0.074475
listener        0.074475
athletic        0.074475
exterminator    0.074475
psychiatrist    0.074475
pest            0.074475
determined      0.074475
chef            0.074475
courageous      0.074475
stylist         0.074475
psychologist    0.074475
organizer       0.074475
pudunga         0.074475
venaam          0.074475
diwali          0.091250
mornings        0.091250
dtype: float64
tfidf_largest
146tf150p    1.000000
havent       1.000000
home         1.000000
okie         1.000000
thanx        1.000000
er           1.000000
anything     1.000000
lei          1.000000
nite         1.000000
yup          1.000000
thank        1.000000
ok           1.000000
where        1.000000
beerage      1.000000
anytime      1.000000
too          1.000000
done         1.000000
645          1.000000
tick         0.980166
blank        0.932702
dtype: float64

We could make improvement on transformation by ignoring terms that have a document frequency strictly lower than 3.

To see if it helps, we fit a multinomial Naive Bayes classifier model with smoothing alpha=0.1 and compute the area under the curve (AUC) score using the transformed test data.

vect = TfidfVectorizer(min_df = 3).fit(X_train)
X_train_vectorized = vect.transform(X_train)
clfrNB = naive_bayes.MultinomialNB(alpha=0.1)
# train the NBC model
clfrNB.fit(X_train_vectorized, y_train)
predictions = clfrNB.predict(vect.transform(X_test))
ROC_score = roc_auc_score(y_test,predictions)
ROC_score
0.9416243654822335

This ROC is smaller than the model trained without tf-idf transformation.

Update tf-idf and add feature

Fit and transform the training data X_train using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than 5.

Using this document-term matrix and an additional feature, the length of document (number of characters), fit a Support Vector Classification model with regularization C=10000. Then compute the area under the curve (AUC) score using the transformed test data.

# the following function is to combine new features into the training data
def add_feature(X, feature_to_add):
    """
    Returns sparse feature matrix with added feature.
    feature_to_add can also be a list of features.
    """
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')
# find the character length for each document
text_train_len = [len(w) for w in X_train]
text_test_len = [len(w) for w in X_test]

# transform training data
vect = TfidfVectorizer(min_df = 5).fit(X_train)
X_train_vectorized = vect.transform(X_train)

# add feature
train_data = add_feature(X_train_vectorized, text_train_len )

# prepare the test data
X_test_vectorized = vect.transform(X_test)
test_data = add_feature(X_test_vectorized,text_test_len)
# train SVM
from sklearn.svm import SVC
clfrSVM = SVC(C = 10000, gamma = 'auto')
clfrSVM.fit(train_data, y_train)
predictions = clfrSVM.predict(test_data)
ROC_score = roc_auc_score(y_test,predictions)    
ROC_score
0.9581366823421557

After increasing the number of terms frequency to 5 and adding a length feature, ROC score is slightly incresed from 0.942 to 0.958.

Update context feature to model

Next we want to know how to add context feature. Without it, the machine may comprehend the two phrases "not an issue, it is working" and "an issue, it is not working" as the same.

We can implement n-grams to add context features. For example, if we add bi-gram, then the machine will treat "is working" as a set.

Fit and transform the training data X_train using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than 5 and using word n-grams from n=1 to n=3 (unigrams, bigrams, and trigrams).

Using this document-term matrix and the following additional features:

  • the length of document (number of characters)
  • number of digits per document

fit a Logistic Regression model with regularization C=100. Then compute the area under the curve (AUC) score using the transformed test data.

# find the desired features
text_train_len = [len(w) for w in X_train]
digit_train_len= X_train.str.findall('(\d)').str.len()
text_test_len = [len(w) for w in X_test]
digit_test_len= X_test.str.findall('(\d)').str.len()

# transform training data
vect = TfidfVectorizer(min_df = 5, ngram_range=(1,3)).fit(X_train)
X_train_vectorized = vect.transform(X_train)
# add feature when training
train_data = add_feature(X_train_vectorized, [text_train_len, digit_train_len] )

# prepare test data
X_test_vectorized = vect.transform(X_test)
# add features
test_data = add_feature(X_test_vectorized,[text_test_len,digit_test_len])

from sklearn.linear_model import LogisticRegression
logistic_model = LogisticRegression(C = 100)
logistic_model.fit(train_data,y_train)
predictions = logistic_model.predict(test_data)
roc_auc_score(y_test, predictions)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)





0.9759031798040846

We can see that the ROC score is even increased more from 0.958 to 0.976.

Finally, fit and transform the training data X_train using a Count Vectorizer ignoring terms that have a document frequency strictly lower than 5 and using character n-grams from n=2 to n=5.

To tell Count Vectorizer to use character n-grams pass in analyzer='char_wb' which creates character n-grams only from text inside word boundaries. This should make the model more robust to spelling mistakes.

Using this document-term matrix and the following additional features:

  • the length of document (number of characters)
  • number of digits per document
  • number of non-word characters (anything other than a letter, digit or underscore.)

fit a Logistic Regression model with regularization C=100. Then compute the area under the curve (AUC) score using the transformed test data.

# get the added features
text_train_len = [len(w) for w in X_train]
# add digit length per document in training set
digit_train_len= X_train.str.findall('(\d)').str.len()
# add number of non-word character
NWC_train_len = X_train.str.findall('\W').str.len()

text_test_len = [len(w) for w in X_test]
# add digit length per document in test set
digit_test_len= X_test.str.findall('(\d)').str.len()
# add number of non-word character
NWC_test_len = X_test.str.findall('\W').str.len()

# update the vectorizer and transform method
vect = TfidfVectorizer(min_df = 5, ngram_range=(2,5), analyzer = 'char_wb').fit(X_train)
X_train_vectorized = vect.transform(X_train)
train_data = add_feature(X_train_vectorized, [text_train_len,digit_train_len,NWC_train_len] )

# Prepare the test data
X_test_vectorized = vect.transform(X_test)
test_data = add_feature(X_test_vectorized,[text_test_len,digit_test_len,NWC_test_len])


logistic_model = LogisticRegression(C = 100)
logistic_model.fit(train_data, y_train)





/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)





LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
predictions = logistic_model.predict(test_data)

auc_score = roc_auc_score(y_test, predictions)
auc_score
0.972947048537426

Even we added three features, the ROC score does not improve much.

Also find the 10 smallest and 10 largest coefficients from the model

sorted_coef_index = logistic_model.coef_[0].argsort()
feature_names = np.array(vect.get_feature_names())
print('Smallest coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest coefs:\n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))


Smallest coefs:
['..' 'i ' 'ca' 'if' ' i' '. ' 'if ' 't;' ' 6' ' if ']

Largest coefs:
['**' 'ww' '***' 'xt' 'co' '****' 'ex' 'uk' 'tone' 'ne']

By sorting the coefficients and looking at the ten smallest and ten largest coefficients, we can see the model has connected characters like '...', 'i', 'ca' to non spam documents. And character like '**', 'ww' as spam documents.

Ending

In this post I have shared the principles of two classical text classification models, SVM and Naive Bayes Classifiers. Besides, I also showed a basic procedure of analyzing the text data.

  1. Read text file.
  2. Make an overall understanding of the data. Like the data size, the proportion of word's length in spam/not spam documents, etc. These could be potential additional features to the modelling.
  3. Vectorize and transform the data for modelling.
  4. Build the model and calculate the evaluation metric.
  5. Improve models' prediction performance by adding features and tring different ways of vectorization and transformation.

Please note that I did not use lemmatization to the features in this study. However, I believe that it is very likely that the prediction performance will be improved after lemmatization.

The biggest challenge for me when doing the text classification is not building models, but to comprehend the functions of vectorizer and transform. Besides, it is important to review basic Regex functions because they are important to the text classification tasks.

Prudence is a fountain of life to the prudent.