Bolin Wu

NLP 2: NLTK Basics in Python

NLP 2: NLTK Basics in Python
2021-07-10 · 13 min read
Natural Language Process Python

In this post I will share what are the basic NLP tasks and how to deal with different tasks by using the powerful NLTK library in Python.

Basic natural language processing

Basic NLP includes any computation, manipulation of natural language in order to get insights about the words' meaning and how sentences are contructed is natural language processing.

NLP board specturm

NLP tasks may include:

  • Counting words, counting frequency of words.
  • Finding sentence boundaries.
  • Part of speech tagging.
  • Parsing the sentence structure.
  • Identifying semantic words.
  • Ifentifying entities in a sentences.
  • Finding which pronoun refers to which entity.

Natural Language Toolkit

Natural Language Toolkit (NLTK). It is an opensouce library in Python and it provides support for most NLP tasks. It also provides access to numerous text corpora. In this post I will share some basic uses of NLTK.

import nltk

# if any module is not available we can use to intall it.

# we can also use the following command to directly download a specific module"book")

from import *

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]  Done downloading collection book
<Text: Moby Dick by Herman Melville 1851>
# if we want to look at one sentence from each of the nine texts
sent1: Call me Ishmael .
sent2: The family of Dashwood had long been settled in Sussex .
sent3: In the beginning God created the heaven and the earth .
sent4: Fellow - Citizens of the Senate and of the House of Representatives :
sent5: I have a problem with people PMing me to lol JOIN
sent6: SCENE 1 : [ wind ] [ clop clop clop ] KING ARTHUR : Whoa there !
sent7: Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
sent8: 25 SEXY MALE , seeks attrac older single lady , for discreet encounters .
sent9: THE suburb of Saffron Park lay on the sunset side of London , as red and ragged as a cloud of sunset .
['Call', 'me', 'Ishmael', '.']
# count the words of sentence 1 
# if we want to count the words of the entire text 1
# set(text1)
# find first 10 unique words of text 1
# the u code stands for the UTF8 encoding. Each token represents the UTF8 code string
['vinegar', 'narrated', 'uninterpenetratingly', 'August', 'affirmative', 'subordinates', 'overbalance', 'Snarles', 'fulfiller', 'select']
dist = FreqDist(text1)
# set of unique words in Wall Street Journal. Same as using len(set(text7))
<FreqDist with 19317 samples and 260819 outcomes>
# dist stores the individual frequencies of each word
[(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024), ('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982)]
# this gives the actual words
vocab1 = dist.keys()
# find how many times a particular word occurs
# find a word that is at leat of length 5 and occurs at least 100 times
freqwords = [w for w in vocab1 if len(w) > 5 and dist[w] > 100]
['called', 'through', 'almost', 'whales', 'thought', 'before', 'against', 'towards', 'things', 'nothing', 'without', 'should', 'little', 'seemed', 'though', 'captain', 'himself', 'moment', 'CHAPTER', 'something', 'Captain', 'between', 'whaling', 'another', 'Queequeg', 'Pequod', 'Starbuck']

Normalization and stemming

Normalization is to transform a word to make it appear the same way or the count even though they look very different.

# Different forms of the same "word"
input1 = "List liasted Lists listing listings"
'List liasted Lists listing listings'
words1 = input1.lower().split(' ')
['list', 'liasted', 'lists', 'listing', 'listings']
porter = nltk.PorterStemmer()
[porter.stem(t) for t in words1]
['list', 'liast', 'list', 'list', 'list']

Lemmatization is similar to normalization. However, it transforms words to be abbrevations that are actually meaningful.

# NLTK has a corpus of the universal declaration of human rights as one of its corpus.
udhr = nltk.corpus.udhr.words('English-Latin1')
['Universal', 'Declaration', 'of', 'Human', 'Rights', 'Preamble', 'Whereas', 'recognition', 'of', 'the', 'inherent', 'dignity', 'and', 'of', 'the', 'equal', 'and', 'inalienable', 'rights', 'of']
[porter.stem(t) for t in udhr[:20]]
['univers', 'declar', 'of', 'human', 'right', 'preambl', 'wherea', 'recognit', 'of', 'the', 'inher', 'digniti', 'and', 'of', 'the', 'equal', 'and', 'inalien', 'right', 'of']

We can see from the results above that 'univers', 'declar' are not real words. Therefore, we want to use lemmatization, that is, stemming but resulting stems are all valid words.

WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in udhr[:20]]
['Universal', 'Declaration', 'of', 'Human', 'Rights', 'Preamble', 'Whereas', 'recognition', 'of', 'the', 'inherent', 'dignity', 'and', 'of', 'the', 'equal', 'and', 'inalienable', 'right', 'of']

At first glance you may feel nothing has changes, but actually it does. For example rights is lemmatized to be right.


Recall splitting a sentence into words/tokens by basic function.

text11 = "Today's weather isn't nice."
text11.split(' ')
["Today's", 'weather', "isn't", 'nice.']

We can see that this is not a good approach. For example it splits the full stop and the last word as one word.

NLTK can help with giving a better splitting.

# NLTK has an in-built tokenizer
text11 = "Today's weather isn't nice."
['Today', "'s", 'weather', 'is', "n't", 'nice', '.']

We can a nice splitted sentence. It does not only split the full stop sperately, but also splits "isn't" into 'is' and "n't". This can be useful if we want to detect negation.

Sentence Splitting

How would you split sentences from a long text string?

text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"
# NLTK has an in-built sentence splitter too
sentences = nltk.sent_tokenize(text12)
# the dots in 'U.S.' and '2.99' is not considered as full stop, great!
['This is the first sentence.', 'A gallon of milk in the U.S. costs $2.99.', 'Is this the third sentence?', 'Yes, it is!']

Part-of-speech (POS) Tagging

NLTK has an in-built POS tagging.

Tag Word class Tag Word class Tag Word class
CC Conjunction JJ Adjective PRP Pronoun
CD Cardinal MD Modal RB Adverb
DT Determiner NN Noun SYM Symbol
IN Preposition POS Possessive VB Verb
# we can get help from the help.upenn_tagset'MD')
MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would
"Today's weather isn't nice."
text13 = nltk.word_tokenize(text11)
[('Today', 'NN'), ("'s", 'POS'), ('weather', 'NN'), ('is', 'VBZ'), ("n't", 'RB'), ('nice', 'JJ'), ('.', '.')]

NLTK practice with real text file

Not let us practice the uses of NLTK with Moby Dick by Herman Melville 1851.

Analyzing Moby Dick

# import data from google drive
# use the following code if want to connect colab to google drive
from google.colab import drive


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
with open('/content/drive/MyDrive/Bolin_DSPost/moby.txt', 'r') as f:
    moby_raw =
# import nltk
# from import *
import pandas as pd
import numpy as np

# If you would like to work with the raw text you can use 'moby_raw'
# with open('moby.txt', 'r') as f:
#     moby_raw =
# If you would like to work with the novel in nltk.Text format you can use 'text1'
moby_tokens = nltk.word_tokenize(moby_raw)
text1 = nltk.Text(moby_tokens)
['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY', '.', '(', 'Supplied', 'by', 'a', 'Late', 'Consumptive', 'Usher', 'to', 'a', 'Grammar']
'[Moby Dick by Herman'
['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY', '.', '(', 'Supplied', 'by', 'a', 'Late', 'Consumptive', 'Usher', 'to', 'a', 'Grammar']

How many tokens, including words and punctuation symbols are in text1 ?

# or use built-in tokenization 

How many unique tokens does text1 have?


If we use lemmatization for the verbs, how many unique tokens does text1 have?

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_verb = [lemmatizer.lemmatize(w, 'v') for w in text1]

['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY', '.', '(', 'Supplied', 'by', 'a', 'Late']

What is the lexical diversity of the text?

len(set(moby_tokens))/ len(moby_tokens)

How many tokens are 'late' or 'Late'?

dist = FreqDist(moby_tokens)
dist[u'late'] + dist[u'Late']

What are the top 20 most frequently occuring unique tokens in the text?

import operator
# dist.items() converts dic to a list of tuples 
# key=operator.itemgetter(n) indicates ordering by nth element

sorted_dist = sorted(dist.items(), key=operator.itemgetter(1), reverse = True)[:20]

[(',', 19204), ('the', 13715), ('.', 7308), ('of', 6513), ('and', 6010), ('a', 4545), ('to', 4515), (';', 4173), ('in', 3908), ('that', 2978), ('his', 2459), ('it', 2196), ('I', 2097), ('!', 1767), ('is', 1722), ('--', 1713), ('with', 1659), ('he', 1658), ('was', 1639), ('as', 1620)]

What are the tokens have a length of greater than 5 and frequency of more than 200?

vocab1 = dist.keys()
freqwords = [w for w in vocab1 if len(w) > 5 and dist[w] > 200]
['Captain', 'Queequeg', 'before', 'himself', 'little', 'seemed', 'though', 'through', 'whales']

Find the longest word in the text and its length

length = max([len(w) for w in moby_tokens]) 
longest_word =[w for w in moby_tokens if len(w) == length]
# use ''.join() to make the result a string
f_tuple = ( ''.join(longest_word),length)
("twelve-o'clock-at-night", 23)

What unique words have a frequency of more than 2000? WHat is their frequency?

# make an empty dictionary to store the words and frequency
fq_dic = {}
for w in vocab1:
  # use isalpha() to check if a token is a word and not punctuation
    if w.isalpha() and dist[w] > 2000:
      fq_dic[w] = dist[w]
result = sorted(fq_dic.items(), key = operator.itemgetter(1), reverse=True)

# switch the column
result = [(f,w) for (w,f) in result]
[(13715, 'the'), (6513, 'of'), (6010, 'and'), (4545, 'a'), (4515, 'to'), (3908, 'in'), (2978, 'that'), (2459, 'his'), (2196, 'it'), (2097, 'I')]

Find the average number of tokens per sentence

sentence = nltk.sent_tokenize(moby_raw)

What are the 5 most frequent parts of speech in the text? What is their frequency?

from collections import Counter
pos_token = nltk.pos_tag(moby_tokens)
[('[', 'JJ'), ('Moby', 'NNP'), ('Dick', 'NNP'), ('by', 'IN'), ('Herman', 'NNP'), ('Melville', 'NNP'), ('1851', 'CD'), (']', 'NNP'), ('ETYMOLOGY', 'NNP'), ('.', '.'), ('(', '('), ('Supplied', 'VBN'), ('by', 'IN'), ('a', 'DT'), ('Late', 'JJ'), ('Consumptive', 'NNP'), ('Usher', 'NNP'), ('to', 'TO'), ('a', 'DT'), ('Grammar', 'NNP')]
Counter((row[1] for row in pos_token)).most_common(5)
# row[1] indicates count elements in the second columns
[('NN', 32730), ('IN', 28657), ('DT', 25867), (',', 19204), ('JJ', 17620)]


When I am practicing NLTK, I find that it is important to identify the type of variables. Is it dictionary, series, or list, etc? As a beginner who is transforming from R user to Python user, I feel that finding appropriate command to handling with different types of variables is challenging. For example, the command to retrieve the top n elements of a list may not applicable for dictionary. I need to spend a lot time reading the documentation but I do enjoy the process.

Python is powerful in handling the NLP tasks. In order to finish the tasks well, we need to have both good understanding of NLTK and other libries like pandas.

Hope this post can be helpful to you, if there is any question please let me know. Cheers!

Prudence is a fountain of life to the prudent.