Bolin Wu

NLP 4: Semantic Text Similarity and Topic Modeling

NLP 4: Semantic Text Similarity and Topic Modeling
2021-07-27 · 19 min read
Natural Language Process Python

Topic modeling is a useful tool for people to grasp a general picture of a long text document. Compared with LSTM or RNN, topic model is more or less for observatory purpose rather than prediction. In this post I will share the measure of similarity among words, the concept of topic modeling and its application in Python.

Semantic text similarity

If we have a text document or a text passage and a sentence. Based on the information in the text passage, we need to say whether the sentence is correct or it derives its meaning from there or not. This is a typical task of semantic similarity. One of the useful resources for semantic similarity analysis is WordNet.


WordNet is a semantic dictionary of words, interlinked by semantic relations. It is mostly developped in English but not it also available for quite a few other languages. It includes rich linguistic information like part of speech, sense of a word, derivationally related forms, etc. WordNet organizes information in a hierarchy. An example could be like this.One measure of using such hierarchy is by path similarity. It is calculated by finding the shortest path between the two concepts. One way to find it is to count how many steps is needed to take to get from one word to the other in the hierarchy. Similarity measure inversely related to path distance.

In Python, WordNet can be imported through NLTK.

import nltk
from nltk.corpus import wordnet as wn

# find appropriate meaning of words
bus = wn.synset('bus.n.01‘)
train = wn.synset('train.n.01’)
# '.n' means finding the noun form
# '.01' means finding the first definition

Once we find the proper meaning of the word, we could find the path similarity as follows.


Distributional similarity

The other different measure of similarity is using distributional similarity and collections. The intuition is that two wrods that frequently appears together are more likely to be semantically related. Like "Play at zoo.", "Play at amusement park". Both "amusement park" and "zoo" appears together with "play". Therefore these two words could be closely related. We can define the range of context by words within a small window, or specific syntactic relation to the taget word, etc.

Once we have defined the context, we can find the strength of correlation. The criteria could be occurance frequency of a word. For example,"is" is a very frequent word, so it has high chance of co-occuring with other words. Therefore it should be paid less weight. One measure is Pointwise Mutual Information

PMI(w,c) = log[P(w,c)/P(w)P(c)]\text{PMI(w,c) = log[P(w,c)/P(w)P(c)]}

where w is word of interes and c is the context word.

In Python, NLTK also provides such measure as follows.

import nltk
from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()

# learn that based on corpus
finder = BigramCollocationFinder.From_words(text)
# get the top 10 pairs using the PMI measures from bigram measures
finder.nbest(bigram_measures.pmi, 10)

# finder also provides other functions
# such as frequency filter

From the discussion above we find that finding similarity between words and text is important and WordNet is a useful tool to identify the semantic relationship. NLTK offers great assistance to such task.

Topic modeling

Topic modeling is a coarse-level analysis of what is in a text collection. It is an exploratory tool used for text mining which can help us to have a general understanding of the text type. For example, is it about sports, or business, or pilitics.
Topic modeling is a coarse-level analysis of what is in a text collection. It can help us to have a general understanding of the text type. For example, it can tell us if the text is about sports, or business, or pilitics.

What is known are the text corpus and number of topics. What is not known are the actual topics and the topic distribution for each document.

There are two common approaches:

  • Probabilistic Latent Semantic Analysis (PLSA).
  • Latent Dirichlet Allocation (LDA). This is what we will further discuss about in this post.


LDA is a generative model used extensively for modeling large text corpora. We can use it as a first step to understand what the text is about.

The general stops of working with LDA in Python is as follows:

  1. Pre-processing the text
  • Tokenize and normalize (lowercase) the sentences.
  • Remove stop words
  • Stemming the words
  1. Convert tokenized documents to a document term matrix.
  2. Build LDA models on the doc-term matrix.

Suppose we have a set of pre-processed text documents doc_set. A general process is as follows:

import gensim
from gensim import corpora, models

# create a dictionary, which is a mapping between IDs and words
dictionary = corpora.Dictionary(doc_set)

# Then create corpus which consists of all the words in doc_set
corpus = [dictionary.doc2bow(doc) for doc in doc_set]

# Create the document term matrix, then put in the LdaModel call
Idamodel = gensim.models.Idamodel.LdaModel(corpus, num_topics = 4, id2word = dictionary, passes = 50)

print(Idamodel.print_topics(num_topics = 4, num_words = 5))

Document similarity and LDA modeling in Python

Build similarity score function from scratch

For the first part, we will make functions doc_to_synsets and similarity_score which will be used by document_path_similarity to find the path similarity between two documents..

  • document_path_similarity: computes the symmetrical path similarity between two documents by finding the synsets in each document using doc_to_synsets, then computing similarities using similarity_score.
import nltk
import numpy as np
from nltk.corpus import wordnet as wn
import pandas as pd

import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd

def convert_tag(tag):
    """Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets"""
    tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
        return tag_dict[tag[0]]
    except KeyError:
        return None

convert_tag: converts the tag given by nltk.pos_tag to a tag used by wordnet.synsets. We will need to use this function in doc_to_synsets.

We want to first formulate following functions:

  • doc_to_synsets: returns a list of synsets in document. This function first tokenizes and part of speech tags the document using nltk.word_tokenize and nltk.pos_tag. Then it should find each tokens corresponding synset using wn.synsets(token, wordnet_tag). The first synset match should be used. If there is no match, that token is skipped.
  • similarity_score: returns the normalized similarity score of a list of synsets (s1) onto a second list of synsets (s2). For each synset in s1, find the synset in s2 with the largest similarity value. Sum all of the largest similarity values together and normalize this value by dividing it by the number of largest similarity values found. Missing values are ignored.
  • similarity_score: returns a normalized similarity score between two documents.
def doc_to_synsets(doc):
    token_text = nltk.word_tokenize(doc)
    # tokenize by nltk, which is covered in the previous NLTK post
    ps_tag = nltk.pos_tag(token_text)
    # store the converted tags
    con_tg = []
    # iterate over tuples in a list
    for index, tp in enumerate(ps_tag):
      word = tp[0]
      tg = tp[1]
      con_tg.insert(index, (word, convert_tag(tg)))
    # find all the synsets, given word w[0] and POS tag w[1]
    synset_list = [wn.synsets(w[0],w[1]) for w in con_tg]
    # get rid of None or empty list
    synset_list = [x for x in synset_list if x]

    # store the first synset match
    valid_synset_01 = []
    for i in range(len(synset_list)):
    return valid_synset_01 
def similarity_score(s1, s2):

    Calculate the normalized similarity score of s1 onto s2, where s1 and s2 are two lists of synsets

    For each synset in s1, finds the synset in s2 with the largest similarity value.
    Sum of all of the largest similarity values and normalize this value by dividing it by the
    number of largest similarity values found.

        s1, s2: list of synsets from doc_to_synsets

        normalized similarity score of s1 onto s2

    # store the largest similarity score
    score_largest = []
    for i in range(len(s2)):
      score_within =[]
      for k in range(len(s1)):
      # if a list consists of 'None' only , the max() function does not work
      if all(x is None for x in score_within):
        score_largest.insert(i, 0)
        score_largest.insert(i, max(list(filter(None, score_within))))
    return sum(score_largest) / len(score_largest)

def document_path_similarity(doc1, doc2):
    Finds the symmetrical similarity between doc1 and doc2
      doc1, doc2: sentences, text, or document

      a similarity measure, data type = float


    synsets1 = doc_to_synsets(doc1)
    synsets2 = doc_to_synsets(doc2)

    return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2

Let us test if the fuctions work

# test
d1 = 'I like cats'
d2 = 'I like dogs and dolphines'
document_path_similarity(d1, d2)

It seems to be working well.

Apply pre-defined functions to text data

The data can be found here.

The file consists of three columns: Quality, D1, and D2. Quality is an indictor if two documents D1 and D2 are paraphrases of each other.

# import data from google drive
# use the following code if want to connect colab to google drive
from google.colab import drive


Mounted at /content/drive
# Use this dataframe for questions most_similar_docs and label_accuracy
paraphrases = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Applied_Text_Mining_in_Python/TopicModeling/paraphrases.csv')
Quality D1 D2
0 1 Ms Stewart, the chief executive, was not expec... Ms Stewart, 61, its chief executive officer an...
1 1 After more than two years' detention under the... After more than two years in detention by the ...
2 1 "It still remains to be seen whether the reven... "It remains to be seen whether the revenue rec...
3 0 And it's going to be a wild ride," said Allan ... Now the rest is just mechanical," said Allan H...
4 1 The cards are issued by Mexico's consulates to... The card is issued by Mexico's consulates to i...

Next, we are interested in getting two pieces of information:

  • most_similar_docs: the pair of documents in paraphrases which has the maximum similarity score.
  • label_accuracy: find labels for the twenty pairs of documents by computing the similarity for each pair using document_path_similarity. Let the classifier rule be that if the score is greater than 0.75, label is paraphrase (1), else label is not paraphrase (0).
import operator
def most_similar_docs():
  similarity_score_i = []
  for i in range(len(paraphrases)):
    d1 = paraphrases.iloc[i]['D1']
    d2 = paraphrases.iloc[i]['D2']
    # similarity_score_i.insert(i,document_path_similarity(d1,d2))
    # sort by the 3rd column, which is the similarity score
  similarity_score_i = sorted(similarity_score_i, key=operator.itemgetter(2), reverse = True)
  return similarity_score_i[0]
print('The most similar paraphrases and corresponding similarity score are: \n {}'.format(most_similar_docs()))
The most similar paraphrases and corresponding similarity score are: 
 ('"Indeed, Iran should be put on notice that efforts to try to remake Iraq in their image will be aggressively put down," he said.', '"Iran should be on notice that attempts to remake Iraq in Iran\'s image will be aggressively put down," he said.\n', 0.9502923976608186)
def label_accuracy():
    from sklearn.metrics import accuracy_score
    paraphrases_score = paraphrases.copy()
    for i in range(len(paraphrases)):
      d1 = paraphrases.iloc[i]['D1']
      d2 = paraphrases.iloc[i]['D2']
      paraphrases_score.loc[i,["similarity_score"]] = document_path_similarity(d1,d2)
    # Your Code Here
    paraphrases_score['LabelByScore'] = np.where(paraphrases_score['similarity_score'] > 0.75,1,0)
    return accuracy_score(paraphrases_score['Quality'], paraphrases_score['LabelByScore'])
print('The accuracy is: {}'.format(label_accuracy()))
The accuracy is: 0.7

Topic Modeling

Here we will use Gensim's LDA (Latent Dirichlet Allocation) model to model topics in newsgroup_data.

import pickle
import gensim
from sklearn.feature_extraction.text import CountVectorizer

# Load the list of documents
with open('/content/drive/MyDrive/Colab Notebooks/Applied_Text_Mining_in_Python/TopicModeling/newsgroups', 'rb') as f:
    newsgroup_data = pickle.load(f)

# Use CountVectorizor to find three or more letter tokens, remove stop_words, 
# remove tokens that don't appear in at least 20 documents,
# remove tokens that appear in more than 20% of the documents
vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english', 
# Fit and transform the input data
X = vect.fit_transform(newsgroup_data)

# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())

# Use the gensim.models.LdaModel constructor to estimate 
# LDA model parameters on the corpus, and save to the variable `ldamodel`

from gensim import corpora, models
# from gensim.models import Idamodel
ldamodel = gensim.models.LdaModel(corpus, num_topics = 10, id2word = id_map, passes = 25, random_state = 34)
print(ldamodel.print_topics(num_topics = 10, num_words = 10))
[(0, '0.056*"edu" + 0.043*"com" + 0.033*"thanks" + 0.022*"mail" + 0.021*"know" + 0.020*"does" + 0.014*"info" + 0.012*"monitor" + 0.010*"looking" + 0.010*"don"'), (1, '0.024*"ground" + 0.018*"current" + 0.018*"just" + 0.013*"want" + 0.013*"use" + 0.011*"using" + 0.011*"used" + 0.010*"power" + 0.010*"speed" + 0.010*"output"'), (2, '0.061*"drive" + 0.042*"disk" + 0.033*"scsi" + 0.030*"drives" + 0.028*"hard" + 0.028*"controller" + 0.027*"card" + 0.020*"rom" + 0.018*"floppy" + 0.017*"bus"'), (3, '0.023*"time" + 0.015*"atheism" + 0.014*"list" + 0.013*"left" + 0.012*"alt" + 0.012*"faq" + 0.012*"probably" + 0.011*"know" + 0.011*"send" + 0.010*"months"'), (4, '0.025*"car" + 0.016*"just" + 0.014*"don" + 0.014*"bike" + 0.012*"good" + 0.011*"new" + 0.011*"think" + 0.010*"year" + 0.010*"cars" + 0.010*"time"'), (5, '0.030*"game" + 0.027*"team" + 0.023*"year" + 0.017*"games" + 0.016*"play" + 0.012*"season" + 0.012*"players" + 0.012*"win" + 0.011*"hockey" + 0.011*"good"'), (6, '0.017*"information" + 0.014*"help" + 0.014*"medical" + 0.012*"new" + 0.012*"use" + 0.012*"000" + 0.012*"research" + 0.011*"university" + 0.010*"number" + 0.010*"program"'), (7, '0.022*"don" + 0.021*"people" + 0.018*"think" + 0.017*"just" + 0.012*"say" + 0.011*"know" + 0.011*"does" + 0.011*"good" + 0.010*"god" + 0.009*"way"'), (8, '0.034*"use" + 0.023*"apple" + 0.020*"power" + 0.016*"time" + 0.015*"data" + 0.015*"software" + 0.012*"pin" + 0.012*"memory" + 0.012*"simms" + 0.012*"port"'), (9, '0.068*"space" + 0.036*"nasa" + 0.021*"science" + 0.020*"edu" + 0.019*"data" + 0.017*"shuttle" + 0.015*"launch" + 0.015*"available" + 0.014*"center" + 0.014*"sci"')]

From the ouput above we can see the words forming the ten topics.

Next question could be, what if we add in a new document?

new_doc = ["\n\n Today is a long day. \
I have not eaten dinner since 8 pm. \
There are too many tasks to do. \n\n\
Bolin\n-- "]
X_new = vect.transform(new_doc)
corpus_new = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
# Take a look at how a part of distribution looks like
[[(6, 0.9470456)],
 [(3, 0.48242074), (5, 0.40150425), (6, 0.096070684)],
 [(4, 0.9139199), (8, 0.057501744)],
 [(0, 0.030418986), (5, 0.10291002), (6, 0.8162316), (8, 0.04535334)],
 [(0, 0.020000072),
  (1, 0.8199883),
  (2, 0.020000583),
  (3, 0.020001208),
  (4, 0.02000103),
  (5, 0.020001303),
  (6, 0.020001443),
  (7, 0.020002373),
  (8, 0.02000331),
  (9, 0.020000374)]]


So far I have shared some basic concepts of similarity measures and application of Topic Modeling.

Several points notice while learning are that

  • When the number of topics are large, it takes a long time to train the LDA model.
  • The LDA seems to regard every token induvidually, not taking the factor of context into account. Therefore there is still space for developping this model's explanatory power.
  • RegEx is foundamental in NLP tasts. When passing parameters to CountVectorizer function, we can specify the token_pattern through RegEx. Though it is trivial, it can facilitate the further analysis greatly.

I hope this post could be helpful for you. If there is any question please let me know.


Prudence is a fountain of life to the prudent.