Topic modeling is a useful tool for people to grasp a general picture of a long text document. Compared with LSTM or RNN, topic model is more or less for observatory purpose rather than prediction. In this post I will share the measure of similarity among words, the concept of topic modeling and its application in Python.
If we have a text document or a text passage and a sentence. Based on the information in the text passage, we need to say whether the sentence is correct or it derives its meaning from there or not. This is a typical task of semantic similarity. One of the useful resources for semantic similarity analysis is WordNet.
WordNet is a semantic dictionary of words, interlinked by semantic relations. It is mostly developped in English but not it also available for quite a few other languages. It includes rich linguistic information like part of speech, sense of a word, derivationally related forms, etc. WordNet organizes information in a hierarchy. An example could be like this.One measure of using such hierarchy is by path similarity. It is calculated by finding the shortest path between the two concepts. One way to find it is to count how many steps is needed to take to get from one word to the other in the hierarchy. Similarity measure inversely related to path distance.
In Python, WordNet can be imported through NLTK.
import nltk
from nltk.corpus import wordnet as wn
# find appropriate meaning of words
bus = wn.synset('bus.n.01‘)
train = wn.synset('train.n.01’)
# '.n' means finding the noun form
# '.01' means finding the first definition
Once we find the proper meaning of the word, we could find the path similarity as follows.
bus.path_similarity(train)
The other different measure of similarity is using distributional similarity and collections. The intuition is that two wrods that frequently appears together are more likely to be semantically related. Like "Play at zoo.", "Play at amusement park". Both "amusement park" and "zoo" appears together with "play". Therefore these two words could be closely related. We can define the range of context by words within a small window, or specific syntactic relation to the taget word, etc.
Once we have defined the context, we can find the strength of correlation. The criteria could be occurance frequency of a word. For example,"is" is a very frequent word, so it has high chance of co-occuring with other words. Therefore it should be paid less weight. One measure is Pointwise Mutual Information
where w is word of interes and c is the context word.
In Python, NLTK also provides such measure as follows.
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
# learn that based on corpus
finder = BigramCollocationFinder.From_words(text)
# get the top 10 pairs using the PMI measures from bigram measures
finder.nbest(bigram_measures.pmi, 10)
# finder also provides other functions
# such as frequency filter
finder.apply_freq_filter(10)
From the discussion above we find that finding similarity between words and text is important and WordNet is a useful tool to identify the semantic relationship. NLTK offers great assistance to such task.
Topic modeling is a coarse-level analysis of what is in a text collection. It is an exploratory tool used for text mining which can help us to have a general understanding of the text type. For example, is it about sports, or business, or pilitics.
Topic modeling is a coarse-level analysis of what is in a text collection. It can help us to have a general understanding of the text type. For example, it can tell us if the text is about sports, or business, or pilitics.
What is known are the text corpus and number of topics. What is not known are the actual topics and the topic distribution for each document.
There are two common approaches:
LDA is a generative model used extensively for modeling large text corpora. We can use it as a first step to understand what the text is about.
The general stops of working with LDA in Python is as follows:
Suppose we have a set of pre-processed text documents doc_set. A general process is as follows:
import gensim
from gensim import corpora, models
# create a dictionary, which is a mapping between IDs and words
dictionary = corpora.Dictionary(doc_set)
# Then create corpus which consists of all the words in doc_set
corpus = [dictionary.doc2bow(doc) for doc in doc_set]
# Create the document term matrix, then put in the LdaModel call
Idamodel = gensim.models.Idamodel.LdaModel(corpus, num_topics = 4, id2word = dictionary, passes = 50)
print(Idamodel.print_topics(num_topics = 4, num_words = 5))
For the first part, we will make functions doc_to_synsets
and similarity_score
which will be used by document_path_similarity
to find the path similarity between two documents..
document_path_similarity:
computes the symmetrical path similarity between two documents by finding the synsets in each document using doc_to_synsets
, then computing similarities using similarity_score
.import nltk
import numpy as np
from nltk.corpus import wordnet as wn
import pandas as pd
# nltk.download('wordnet')
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
def convert_tag(tag):
"""Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets"""
tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
try:
return tag_dict[tag[0]]
except KeyError:
return None
convert_tag:
converts the tag given by nltk.pos_tag
to a tag used by wordnet.synsets
. We will need to use this function in doc_to_synsets
.
We want to first formulate following functions:
doc_to_synsets:
returns a list of synsets in document. This function first tokenizes and part of speech tags the document using nltk.word_tokenize
and nltk.pos_tag
. Then it should find each tokens corresponding synset using wn.synsets(token, wordnet_tag)
. The first synset match should be used. If there is no match, that token is skipped.similarity_score:
returns the normalized similarity score of a list of synsets (s1) onto a second list of synsets (s2). For each synset in s1, find the synset in s2 with the largest similarity value. Sum all of the largest similarity values together and normalize this value by dividing it by the number of largest similarity values found. Missing values are ignored.similarity_score:
returns a normalized similarity score between two documents.def doc_to_synsets(doc):
token_text = nltk.word_tokenize(doc)
# tokenize by nltk, which is covered in the previous NLTK post
ps_tag = nltk.pos_tag(token_text)
# store the converted tags
con_tg = []
# iterate over tuples in a list
for index, tp in enumerate(ps_tag):
word = tp[0]
tg = tp[1]
con_tg.insert(index, (word, convert_tag(tg)))
# find all the synsets, given word w[0] and POS tag w[1]
synset_list = [wn.synsets(w[0],w[1]) for w in con_tg]
# get rid of None or empty list
synset_list = [x for x in synset_list if x]
# store the first synset match
valid_synset_01 = []
for i in range(len(synset_list)):
valid_synset_01.append(synset_list[i][0])
return valid_synset_01
def similarity_score(s1, s2):
"""
Calculate the normalized similarity score of s1 onto s2, where s1 and s2 are two lists of synsets
For each synset in s1, finds the synset in s2 with the largest similarity value.
Sum of all of the largest similarity values and normalize this value by dividing it by the
number of largest similarity values found.
Args:
s1, s2: list of synsets from doc_to_synsets
Returns:
normalized similarity score of s1 onto s2
"""
# store the largest similarity score
score_largest = []
for i in range(len(s2)):
score_within =[]
for k in range(len(s1)):
score_within.append(s2[i].path_similarity(s1[k]))
# if a list consists of 'None' only , the max() function does not work
if all(x is None for x in score_within):
score_largest.insert(i, 0)
else:
score_largest.insert(i, max(list(filter(None, score_within))))
return sum(score_largest) / len(score_largest)
def document_path_similarity(doc1, doc2):
"""
Finds the symmetrical similarity between doc1 and doc2
Arg:
doc1, doc2: sentences, text, or document
Returns:
a similarity measure, data type = float
"""
synsets1 = doc_to_synsets(doc1)
synsets2 = doc_to_synsets(doc2)
return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2
Let us test if the fuctions work
# test
d1 = 'I like cats'
d2 = 'I like dogs and dolphines'
document_path_similarity(d1, d2)
0.7333333333333334
It seems to be working well.
The data can be found here.
The file consists of three columns: Quality
, D1
, and D2
. Quality
is an indictor if two documents D1
and D2
are paraphrases of each other.
# import data from google drive
# use the following code if want to connect colab to google drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# Use this dataframe for questions most_similar_docs and label_accuracy
paraphrases = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Applied_Text_Mining_in_Python/TopicModeling/paraphrases.csv')
paraphrases.head()
Quality | D1 | D2 | |
---|---|---|---|
0 | 1 | Ms Stewart, the chief executive, was not expec... | Ms Stewart, 61, its chief executive officer an... |
1 | 1 | After more than two years' detention under the... | After more than two years in detention by the ... |
2 | 1 | "It still remains to be seen whether the reven... | "It remains to be seen whether the revenue rec... |
3 | 0 | And it's going to be a wild ride," said Allan ... | Now the rest is just mechanical," said Allan H... |
4 | 1 | The cards are issued by Mexico's consulates to... | The card is issued by Mexico's consulates to i... |
Next, we are interested in getting two pieces of information:
most_similar_docs:
the pair of documents in paraphrases which has the maximum similarity score.label_accuracy:
find labels for the twenty pairs of documents by computing the similarity for each pair using document_path_similarity
. Let the classifier rule be that if the score is greater than 0.75, label is paraphrase (1), else label is not paraphrase (0).import operator
def most_similar_docs():
similarity_score_i = []
for i in range(len(paraphrases)):
d1 = paraphrases.iloc[i]['D1']
d2 = paraphrases.iloc[i]['D2']
# similarity_score_i.insert(i,document_path_similarity(d1,d2))
similarity_score_i.insert(i,(d1,d2,document_path_similarity(d1,d2)))
# sort by the 3rd column, which is the similarity score
similarity_score_i = sorted(similarity_score_i, key=operator.itemgetter(2), reverse = True)
return similarity_score_i[0]
print('The most similar paraphrases and corresponding similarity score are: \n {}'.format(most_similar_docs()))
The most similar paraphrases and corresponding similarity score are:
('"Indeed, Iran should be put on notice that efforts to try to remake Iraq in their image will be aggressively put down," he said.', '"Iran should be on notice that attempts to remake Iraq in Iran\'s image will be aggressively put down," he said.\n', 0.9502923976608186)
def label_accuracy():
from sklearn.metrics import accuracy_score
paraphrases_score = paraphrases.copy()
for i in range(len(paraphrases)):
d1 = paraphrases.iloc[i]['D1']
d2 = paraphrases.iloc[i]['D2']
paraphrases_score.loc[i,["similarity_score"]] = document_path_similarity(d1,d2)
# Your Code Here
paraphrases_score['LabelByScore'] = np.where(paraphrases_score['similarity_score'] > 0.75,1,0)
return accuracy_score(paraphrases_score['Quality'], paraphrases_score['LabelByScore'])
print('The accuracy is: {}'.format(label_accuracy()))
The accuracy is: 0.7
Here we will use Gensim's LDA (Latent Dirichlet Allocation) model to model topics in newsgroup_data.
import pickle
import gensim
from sklearn.feature_extraction.text import CountVectorizer
# Load the list of documents
with open('/content/drive/MyDrive/Colab Notebooks/Applied_Text_Mining_in_Python/TopicModeling/newsgroups', 'rb') as f:
newsgroup_data = pickle.load(f)
# Use CountVectorizor to find three or more letter tokens, remove stop_words,
# remove tokens that don't appear in at least 20 documents,
# remove tokens that appear in more than 20% of the documents
vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english',
token_pattern=(r'\b\w\w\w+\b'))
# Fit and transform the input data
X = vect.fit_transform(newsgroup_data)
# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())
len(vect.vocabulary_.items())
901
# Use the gensim.models.LdaModel constructor to estimate
# LDA model parameters on the corpus, and save to the variable `ldamodel`
from gensim import corpora, models
# from gensim.models import Idamodel
ldamodel = gensim.models.LdaModel(corpus, num_topics = 10, id2word = id_map, passes = 25, random_state = 34)
print(ldamodel.print_topics(num_topics = 10, num_words = 10))
[(0, '0.056*"edu" + 0.043*"com" + 0.033*"thanks" + 0.022*"mail" + 0.021*"know" + 0.020*"does" + 0.014*"info" + 0.012*"monitor" + 0.010*"looking" + 0.010*"don"'), (1, '0.024*"ground" + 0.018*"current" + 0.018*"just" + 0.013*"want" + 0.013*"use" + 0.011*"using" + 0.011*"used" + 0.010*"power" + 0.010*"speed" + 0.010*"output"'), (2, '0.061*"drive" + 0.042*"disk" + 0.033*"scsi" + 0.030*"drives" + 0.028*"hard" + 0.028*"controller" + 0.027*"card" + 0.020*"rom" + 0.018*"floppy" + 0.017*"bus"'), (3, '0.023*"time" + 0.015*"atheism" + 0.014*"list" + 0.013*"left" + 0.012*"alt" + 0.012*"faq" + 0.012*"probably" + 0.011*"know" + 0.011*"send" + 0.010*"months"'), (4, '0.025*"car" + 0.016*"just" + 0.014*"don" + 0.014*"bike" + 0.012*"good" + 0.011*"new" + 0.011*"think" + 0.010*"year" + 0.010*"cars" + 0.010*"time"'), (5, '0.030*"game" + 0.027*"team" + 0.023*"year" + 0.017*"games" + 0.016*"play" + 0.012*"season" + 0.012*"players" + 0.012*"win" + 0.011*"hockey" + 0.011*"good"'), (6, '0.017*"information" + 0.014*"help" + 0.014*"medical" + 0.012*"new" + 0.012*"use" + 0.012*"000" + 0.012*"research" + 0.011*"university" + 0.010*"number" + 0.010*"program"'), (7, '0.022*"don" + 0.021*"people" + 0.018*"think" + 0.017*"just" + 0.012*"say" + 0.011*"know" + 0.011*"does" + 0.011*"good" + 0.010*"god" + 0.009*"way"'), (8, '0.034*"use" + 0.023*"apple" + 0.020*"power" + 0.016*"time" + 0.015*"data" + 0.015*"software" + 0.012*"pin" + 0.012*"memory" + 0.012*"simms" + 0.012*"port"'), (9, '0.068*"space" + 0.036*"nasa" + 0.021*"science" + 0.020*"edu" + 0.019*"data" + 0.017*"shuttle" + 0.015*"launch" + 0.015*"available" + 0.014*"center" + 0.014*"sci"')]
From the ouput above we can see the words forming the ten topics.
Next question could be, what if we add in a new document?
new_doc = ["\n\n Today is a long day. \
I have not eaten dinner since 8 pm. \
There are too many tasks to do. \n\n\
Bolin\n-- "]
X_new = vect.transform(new_doc)
corpus_new = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
# Take a look at how a part of distribution looks like
list(ldamodel.get_document_topics(corpus_new))[:5]
[[(6, 0.9470456)],
[(3, 0.48242074), (5, 0.40150425), (6, 0.096070684)],
[(4, 0.9139199), (8, 0.057501744)],
[(0, 0.030418986), (5, 0.10291002), (6, 0.8162316), (8, 0.04535334)],
[(0, 0.020000072),
(1, 0.8199883),
(2, 0.020000583),
(3, 0.020001208),
(4, 0.02000103),
(5, 0.020001303),
(6, 0.020001443),
(7, 0.020002373),
(8, 0.02000331),
(9, 0.020000374)]]
So far I have shared some basic concepts of similarity measures and application of Topic Modeling.
Several points notice while learning are that
I hope this post could be helpful for you. If there is any question please let me know.
Cheers!