In this post I will share what are the basic NLP tasks and how to deal with different tasks by using the powerful NLTK library in Python.
Basic NLP includes any computation, manipulation of natural language in order to get insights about the words' meaning and how sentences are contructed is natural language processing.
NLP tasks may include:
Natural Language Toolkit (NLTK). It is an opensouce library in Python and it provides support for most NLP tasks. It also provides access to numerous text corpora. In this post I will share some basic uses of NLTK.
import nltk
# if any module is not available we can use nltk.download() to intall it.
# nltk.download()
# we can also use the following command to directly download a specific module
nltk.download("book")
from nltk.book import *
[nltk_data] Downloading collection 'book'
[nltk_data] |
...
[nltk_data] Done downloading collection book
text1
<Text: Moby Dick by Herman Melville 1851>
# if we want to look at one sentence from each of the nine texts
sents()
sent1: Call me Ishmael .
sent2: The family of Dashwood had long been settled in Sussex .
sent3: In the beginning God created the heaven and the earth .
sent4: Fellow - Citizens of the Senate and of the House of Representatives :
sent5: I have a problem with people PMing me to lol JOIN
sent6: SCENE 1 : [ wind ] [ clop clop clop ] KING ARTHUR : Whoa there !
sent7: Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
sent8: 25 SEXY MALE , seeks attrac older single lady , for discreet encounters .
sent9: THE suburb of Saffron Park lay on the sunset side of London , as red and ragged as a cloud of sunset .
sent1
['Call', 'me', 'Ishmael', '.']
# count the words of sentence 1
len(sent1)
4
# if we want to count the words of the entire text 1
len(text1)
260819
# set(text1)
# find first 10 unique words of text 1
list(set(text1))[:10]
# the u code stands for the UTF8 encoding. Each token represents the UTF8 code string
['vinegar', 'narrated', 'uninterpenetratingly', 'August', 'affirmative', 'subordinates', 'overbalance', 'Snarles', 'fulfiller', 'select']
dist = FreqDist(text1)
# set of unique words in Wall Street Journal. Same as using len(set(text7))
len(dist)
19317
print(dist)
<FreqDist with 19317 samples and 260819 outcomes>
# dist stores the individual frequencies of each word
dist.most_common(10)
[(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024), ('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982)]
# this gives the actual words
vocab1 = dist.keys()
# find how many times a particular word occurs
dist[u'four']
74
dist[u'68']
1
# find a word that is at leat of length 5 and occurs at least 100 times
freqwords = [w for w in vocab1 if len(w) > 5 and dist[w] > 100]
freqwords
['called', 'through', 'almost', 'whales', 'thought', 'before', 'against', 'towards', 'things', 'nothing', 'without', 'should', 'little', 'seemed', 'though', 'captain', 'himself', 'moment', 'CHAPTER', 'something', 'Captain', 'between', 'whaling', 'another', 'Queequeg', 'Pequod', 'Starbuck']
Normalization is to transform a word to make it appear the same way or the count even though they look very different.
# Different forms of the same "word"
input1 = "List liasted Lists listing listings"
input1.rstrip()
'List liasted Lists listing listings'
words1 = input1.lower().split(' ')
words1
['list', 'liasted', 'lists', 'listing', 'listings']
porter = nltk.PorterStemmer()
[porter.stem(t) for t in words1]
['list', 'liast', 'list', 'list', 'list']
Lemmatization is similar to normalization. However, it transforms words to be abbrevations that are actually meaningful.
# NLTK has a corpus of the universal declaration of human rights as one of its corpus.
udhr = nltk.corpus.udhr.words('English-Latin1')
udhr[:20]
['Universal', 'Declaration', 'of', 'Human', 'Rights', 'Preamble', 'Whereas', 'recognition', 'of', 'the', 'inherent', 'dignity', 'and', 'of', 'the', 'equal', 'and', 'inalienable', 'rights', 'of']
[porter.stem(t) for t in udhr[:20]]
['univers', 'declar', 'of', 'human', 'right', 'preambl', 'wherea', 'recognit', 'of', 'the', 'inher', 'digniti', 'and', 'of', 'the', 'equal', 'and', 'inalien', 'right', 'of']
We can see from the results above that 'univers', 'declar' are not real words. Therefore, we want to use lemmatization, that is, stemming but resulting stems are all valid words.
WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in udhr[:20]]
['Universal', 'Declaration', 'of', 'Human', 'Rights', 'Preamble', 'Whereas', 'recognition', 'of', 'the', 'inherent', 'dignity', 'and', 'of', 'the', 'equal', 'and', 'inalienable', 'right', 'of']
At first glance you may feel nothing has changes, but actually it does. For example rights is lemmatized to be right.
Recall splitting a sentence into words/tokens by basic function.
text11 = "Today's weather isn't nice."
text11.split(' ')
["Today's", 'weather', "isn't", 'nice.']
We can see that this is not a good approach. For example it splits the full stop and the last word as one word.
NLTK can help with giving a better splitting.
# NLTK has an in-built tokenizer
text11 = "Today's weather isn't nice."
nltk.word_tokenize(text11)
['Today', "'s", 'weather', 'is', "n't", 'nice', '.']
We can a nice splitted sentence. It does not only split the full stop sperately, but also splits "isn't" into 'is' and "n't". This can be useful if we want to detect negation.
How would you split sentences from a long text string?
text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"
# NLTK has an in-built sentence splitter too
sentences = nltk.sent_tokenize(text12)
len(sentences)
4
sentences
# the dots in 'U.S.' and '2.99' is not considered as full stop, great!
['This is the first sentence.', 'A gallon of milk in the U.S. costs $2.99.', 'Is this the third sentence?', 'Yes, it is!']
NLTK has an in-built POS tagging.
Tag | Word class | Tag | Word class | Tag | Word class |
---|---|---|---|---|---|
CC | Conjunction | JJ | Adjective | PRP | Pronoun |
CD | Cardinal | MD | Modal | RB | Adverb |
DT | Determiner | NN | Noun | SYM | Symbol |
IN | Preposition | POS | Possessive | VB | Verb |
# we can get help from the help.upenn_tagset
nltk.help.upenn_tagset('MD')
MD: modal auxiliary
can cannot could couldn't dare may might must need ought shall should
shouldn't will would
text11
"Today's weather isn't nice."
text13 = nltk.word_tokenize(text11)
nltk.pos_tag(text13)
[('Today', 'NN'), ("'s", 'POS'), ('weather', 'NN'), ('is', 'VBZ'), ("n't", 'RB'), ('nice', 'JJ'), ('.', '.')]
Not let us practice the uses of NLTK with Moby Dick by Herman Melville 1851.
# import data from google drive
# use the following code if want to connect colab to google drive
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
with open('/content/drive/MyDrive/Bolin_DSPost/moby.txt', 'r') as f:
moby_raw = f.read()
# import nltk
# nltk.download("book")
# from nltk.book import *
import pandas as pd
import numpy as np
# If you would like to work with the raw text you can use 'moby_raw'
# with open('moby.txt', 'r') as f:
# moby_raw = f.read()
# If you would like to work with the novel in nltk.Text format you can use 'text1'
moby_tokens = nltk.word_tokenize(moby_raw)
text1 = nltk.Text(moby_tokens)
moby_tokens[:20]
['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY', '.', '(', 'Supplied', 'by', 'a', 'Late', 'Consumptive', 'Usher', 'to', 'a', 'Grammar']
moby_raw[:20]
'[Moby Dick by Herman'
text1[:20]
['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY', '.', '(', 'Supplied', 'by', 'a', 'Late', 'Consumptive', 'Usher', 'to', 'a', 'Grammar']
len(text1)
254989
# or use built-in tokenization
len(nltk.word_tokenize(moby_raw))
254989
len(set(text1))
20755
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_verb = [lemmatizer.lemmatize(w, 'v') for w in text1]
lemmatized_verb[:15]
['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY', '.', '(', 'Supplied', 'by', 'a', 'Late']
len(set(lemmatized_verb))
16900
len(set(moby_tokens))/ len(moby_tokens)
0.08139566804842562
dist = FreqDist(moby_tokens)
dist[u'late'] + dist[u'Late']
30
import operator
# dist.items() converts dic to a list of tuples
# key=operator.itemgetter(n) indicates ordering by nth element
sorted_dist = sorted(dist.items(), key=operator.itemgetter(1), reverse = True)[:20]
sorted_dist
[(',', 19204), ('the', 13715), ('.', 7308), ('of', 6513), ('and', 6010), ('a', 4545), ('to', 4515), (';', 4173), ('in', 3908), ('that', 2978), ('his', 2459), ('it', 2196), ('I', 2097), ('!', 1767), ('is', 1722), ('--', 1713), ('with', 1659), ('he', 1658), ('was', 1639), ('as', 1620)]
vocab1 = dist.keys()
freqwords = [w for w in vocab1 if len(w) > 5 and dist[w] > 200]
sorted(freqwords)
['Captain', 'Queequeg', 'before', 'himself', 'little', 'seemed', 'though', 'through', 'whales']
length = max([len(w) for w in moby_tokens])
longest_word =[w for w in moby_tokens if len(w) == length]
# use ''.join() to make the result a string
f_tuple = ( ''.join(longest_word),length)
f_tuple
("twelve-o'clock-at-night", 23)
# make an empty dictionary to store the words and frequency
fq_dic = {}
for w in vocab1:
# use isalpha() to check if a token is a word and not punctuation
if w.isalpha() and dist[w] > 2000:
fq_dic[w] = dist[w]
result = sorted(fq_dic.items(), key = operator.itemgetter(1), reverse=True)
# switch the column
result = [(f,w) for (w,f) in result]
result
[(13715, 'the'), (6513, 'of'), (6010, 'and'), (4545, 'a'), (4515, 'to'), (3908, 'in'), (2978, 'that'), (2459, 'his'), (2196, 'it'), (2097, 'I')]
sentence = nltk.sent_tokenize(moby_raw)
len(moby_tokens)/len(sentences)
63747.25
from collections import Counter
pos_token = nltk.pos_tag(moby_tokens)
pos_token[:20]
[('[', 'JJ'), ('Moby', 'NNP'), ('Dick', 'NNP'), ('by', 'IN'), ('Herman', 'NNP'), ('Melville', 'NNP'), ('1851', 'CD'), (']', 'NNP'), ('ETYMOLOGY', 'NNP'), ('.', '.'), ('(', '('), ('Supplied', 'VBN'), ('by', 'IN'), ('a', 'DT'), ('Late', 'JJ'), ('Consumptive', 'NNP'), ('Usher', 'NNP'), ('to', 'TO'), ('a', 'DT'), ('Grammar', 'NNP')]
Counter((row[1] for row in pos_token)).most_common(5)
# row[1] indicates count elements in the second columns
[('NN', 32730), ('IN', 28657), ('DT', 25867), (',', 19204), ('JJ', 17620)]
When I am practicing NLTK, I find that it is important to identify the type of variables. Is it dictionary, series, or list, etc? As a beginner who is transforming from R user to Python user, I feel that finding appropriate command to handling with different types of variables is challenging. For example, the command to retrieve the top n elements of a list may not applicable for dictionary. I need to spend a lot time reading the documentation but I do enjoy the process.
Python is powerful in handling the NLP tasks. In order to finish the tasks well, we need to have both good understanding of NLTK and other libries like pandas.
Hope this post can be helpful to you, if there is any question please let me know. Cheers!