Bolin Wu

NLP 1:Text Mining Application in Python (RegEx)

NLP 1:Text Mining Application in Python (RegEx)
2021-06-15 · 21 min read
Natural Language Process Python

In this exercise, we'll be working with messy medical data and using RegEx in Python to extract dates of different formats. The goal of this exercise is to correctly identify all of the different date variants encoded in this dataset and to properly standardize and sort the dates.

Introduction

Each line of the dates.txt file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.

Here is a list of some of the possible variants in this dataset:

  1. 04/20/2009; 04/20/09; 4/20/09; 4/3/09
  2. Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
  3. 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
  4. Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
  5. Feb 2009; Sep 2009; Oct 2010
  6. 6/2008; 12/2009

Once we extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:

  • Assume all dates in xx/xx/xx format are mm/dd/yy
  • Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
  • If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
  • If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
  • There could be potential typos as this is a raw, real-life derived dataset.

With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices.

For example if the original series was this:

0    1999
1    2010
2    1978
3    2015
4    1985

We should return this:

0    2
1    4
2    0
3    1
4    3

As a result, we could know the chronological order of the medical records in the text file.

Useful cheetsheet

Meta-characters: Character matches

  • .: wildcard, matches a single character
  • ^: start of a string
  • $: end of a string
  • []: matches one of the set of characters within []
  • [a-z]: matches one of the range of characters a, b, ..., z
  • a|b: matches either a or b, where a and b are strings
  • (): Scoping for operators.
  • backslash: escape character for special characters (\t,\n,\b)
  • \b: matches word boundary
  • \d: any digit, equivalent to [^0-9]
  • \D: any non-digit, equivalent to [^0-9]
  • \s: any whitespace, equivalent to [\t\n\r\f\v]
  • \S: any non-whitespace, equivalent to [^ \t\n\r\f\v]
  • \w: Alphanumeric character, euivalent to [a-zA-Z0-9]
  • \W: Non-alphanumeric, equivalent to [^a-zA-Z0-9]

Meta-characters: Repetitions

  • *: matches zero or more occurences
  • +: matches one or more occurrences
  • ?: matches zero or one occurences
  • {n}: exactly n repetitions, 0 \leq n
  • {,n}: at most n repetitions
  • {m,n}: at least m and at most n repetitions

Import data

I am doing this exercise on Google Colab. We can easily import the data from Google Drive to Colab by using the following code. More detailed instruction can be found here.

Data is available here.

When reading the text, I removed the paranthesis in order to facilitate text cleaning.

# use the following code if you want to connect colab to google drive
from google.colab import drive

drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
import pandas as pd
import re as re

doc = []
with open('/content/drive/MyDrive/Bolin_DSPost/dates.txt') as file:
    for line in file:
      # remove the paranthesis in each record
        line = re.sub('[()]', '', line)
        doc.append(line)

df = pd.Series(doc)
df.head(10)
0           03/25/93 Total time of visit in minutes:\n
1                       6/18/85 Primary Care Doctor:\n
2    sshe plans to move as of 7/8/71 In-Home Servic...
3                7 on 9/27/75 Audit C Score Current:\n
4    2/6/96 sleep studyPain Treatment Pain Level Nu...
5                    .Per 7/06/79 Movement D/O note:\n
6    4, 5/18/78 Patient's thoughts about current su...
7    10/24/89 CPT Code: 90801 - Psychiatric Diagnos...
8                         3/7/86 SOS-10 Total Score:\n
9               4/10/71Score-1Audit C Score Current:\n
dtype: object

Extract date and text cleaning

Now we will proceed to cleaning the data. Instead of cleaning all the 6 types mentioned above at once, we will clean one type at a step. It will make debug more easily.

Some useful resources of dealing with free text are listed as follows:

Regular Expressions
Regular expressions documentation in Python 3

Tips and tricks of the trade for cleaning text in Python

https://chrisalbon.com/python/basics/cleaning_text/

https://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/

http://ieva.rocks/2016/08/07/cleaning-text-for-nlp/

https://chrisalbon.com/python/cleaning_text.html

1. mm/dd/yy format

# 1 
df1 = df.str.extract(r'(?P<month>\d{1,2})[/-](?P<day>\d{1,2})[/-](?P<year>\d{2,4})').dropna(how = 'any')
df1
#done
month day year
0 03 25 93
1 6 18 85
2 7 8 71
3 9 27 75
4 2 6 96
... ... ... ...
120 12 22 98
121 10 02 96
122 11 05 90
123 5 04 77
124 2 27 96

125 rows × 3 columns

2. Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009 format

# 2
df2 = df.str.extract(r'(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)[ ,.-]+(?P<day>\d{2,4})[ ,.-]+(?P<year>\d{2,4})').dropna(how = 'any')#.iloc[125]
df2
# done
month day year
194 April 11 1990
195 May 30 2001
196 Feb 18 1994
197 February 18 1981
198 October 11 2013
199 Jan 24 1986
200 July 26 1978
201 December 23 1999
202 May 15 1989
203 September 06 1995
204 Mar 10 1976
205 Jan 27 1983
206 October 23 1990
207 August 12 2004
208 September 01 2012
209 July 25 1983
210 August 11 1989
211 April 17 1992
212 July 24 1999
213 July 11 1997
214 Sep 10 1974
215 August 14 1981
216 Nov 11 1988
217 June 13 2011
218 May 14 1989
219 Dec 14 1975
220 June 25 2012
221 Oct 18 1980
222 May 15 1998
223 October 14 1974
224 July 25 1998
225 June 15 1972
226 January 07 1991
227 September 15 2011

Change month name to a number

# # Change month name to a number
def month_string_to_number(string):
    m = {
        'jan': 1,
        'feb': 2,
        'mar': 3,
        'apr':4,
         'may':5,
         'jun':6,
         'jul':7,
         'aug':8,
         'sep':9,
         'oct':10,
         'nov':11,
         'dec':12
        }
    s = string.strip()[:3].lower()

    try:
        out = m[s]
        return out
    except:
        raise ValueError('Not a month')

# test if the function works
month_string_to_number('October')

10

Cool, it works!

for index, row in df2.iterrows():
    # print(row['month'])
    row['month'] = month_string_to_number(row['month'])
df2
month day year
194 4 11 1990
195 5 30 2001
196 2 18 1994
197 2 18 1981
198 10 11 2013
199 1 24 1986
200 7 26 1978
201 12 23 1999
202 5 15 1989
203 9 06 1995
204 3 10 1976
205 1 27 1983
206 10 23 1990
207 8 12 2004
208 9 01 2012
209 7 25 1983
210 8 11 1989
211 4 17 1992
212 7 24 1999
213 7 11 1997
214 9 10 1974
215 8 14 1981
216 11 11 1988
217 6 13 2011
218 5 14 1989
219 12 14 1975
220 6 25 2012
221 10 18 1980
222 5 15 1998
223 10 14 1974
224 7 25 1998
225 6 15 1972
226 1 07 1991
227 9 15 2011

3. 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009 format

# 3
df3 = df.str.extract(r'(?P<day>\d{1,2})[ ,.-](?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)[ ,.-](?P<year>\d{2,4})').dropna(how = 'any')#.iloc[125]
#done
df3
day month year
125 24 Jan 2001
126 10 Sep 2004
127 26 May 1982
128 28 June 2002
129 06 May 1972
... ... ... ...
189 21 Oct 1977
190 10 Aug 2000
191 30 Nov 1972
192 06 May 1993
193 18 Jan 1995

69 rows × 3 columns

# change month name to number
for index, row in df3.iterrows():
    # print(row['month'])
    row['month'] = month_string_to_number(row['month'])
df3

day month year
125 24 1 2001
126 10 9 2004
127 26 5 1982
128 28 6 2002
129 06 5 1972
... ... ... ...
189 21 10 1977
190 10 8 2000
191 30 11 1972
192 06 5 1993
193 18 1 1995

69 rows × 3 columns

4. Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009 format

# 4
df.str.extract(r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)[ ,.](\d{2,4})(?:th|st|rd), (\d{2,4})').dropna(how = 'any')
0 1 2

No record is written in this format.

5. Feb 2009; Sep 2009; Oct 2010 format

Many records with date in this format have different noises. Therefore for this format I will first make a general regex expreesion to extract the majority date records in this formate. Then I will search for the records that are not extracted, find the reason and modify the regex expression.

# 5
df5_1 = df.str.extract(r'\D(?P<day> )(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*),?[ ](?P<year>\d{4})').dropna(how = 'any')#.iloc[125]
df5_1
day month year
228 September 1985
229 June 2011
230 May 1986
231 May 2016
232 July 1977
... ... ... ...
337 Dec 2007
338 Apr 1998
339 March 2005
340 May 1980
342 March 1976

98 rows × 3 columns

for index, row in df5_1.iterrows():
    # print(row['month'])
    row['month'] = month_string_to_number(row['month'])
df5_1
day month year
228 9 1985
229 6 2011
230 5 1986
231 5 2016
232 7 1977
... ... ... ...
337 12 2007
338 4 1998
339 3 2005
340 5 1980
342 3 1976

98 rows × 3 columns

Find missing indices:

# find missing indices
missing_indices = [index 
                   for index in range(df5_1.index[0], df5_1.index[len(df5_1.index)-1])
                   if index not in df5_1.index]

missing_indices
[235,
 245,
 247,
 251,
 256,
 266,
 270,
 276,
 281,
 283,
 285,
 294,
 299,
 306,
 321,
 333,
 341]

We can see that some records are missing, e.g. 341. Let us see how this record looks like.

df.iloc[341]
'.Nov 2007- Evaluation and treatment at CH ED/APS for acute alcohol and medication OD, with history of depression relapse and alcohol abuse PRIOR TWO WEEKS due to financial stress when she lost some Meteorological Technician work.  Discharged to detox and follow up with Fresenius Medical Care Center.\n'
# 5
# fix one character right before Month, like yAug, 2004
df5_2 = df.str.extract(r'^[\D\. ]?(?P<day>)(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)[,. ]*(?P<year>\d{4})\D').dropna(how = 'any')#.iloc[]

df5_2
day month year
235 Oct 2015
245 November 1990
247 May 1983
251 December 1998
256 Aug 1988
266 September 1999
270 May 2006
276 April 1986
281 Aug 2004
283 Feb 1977
285 Sep 1983
294 February 1983
299 March 1974
306 May 2004
333 November 1997
341 Nov 2007
for index, row in df5_2.iterrows():
    # print(row['month'])
    row['month'] = month_string_to_number(row['month'])
df5_2
day month year
235 10 2015
245 11 1990
247 5 1983
251 12 1998
256 8 1988
266 9 1999
270 5 2006
276 4 1986
281 8 2004
283 2 1977
285 9 1983
294 2 1983
299 3 1974
306 5 2004
333 11 1997
341 11 2007
# find missing indices after the first search
missing_indices = [index 
                   for index in range(df5_1.index[0], df5_1.index[len(df5_1.index)-1])
                   if index not in (df5_1.index | df5_2.index)]
missing_indices
[321]

Still, record 321 is missing.

df.iloc[321]
'2June, 1999 Audit C Score Current:\n'
# 5
# fix one number right before Month, like 2June, 1999
df5_3 = df.str.extract(r'[^]*[\D\. ](?P<day>)(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)[,. ]*(?P<year>\d{4})\D').dropna(how = 'any')#.iloc[]
df5_3
day month year
321 June 1999
for index, row in df5_3.iterrows():
    # print(row['month'])
    row['month'] = month_string_to_number(row['month'])
df5_3
day month year
321 6 1999

6. 6/2008; 12/2009 format

# 6 
df6 = df.str.extract(r'(?:^| |~|[a-z])+-?(?P<day>)(?P<month>\d{1,2})[/-](?P<year>\d{4})').dropna(how = 'any')#.iloc[343:454]
df6
day month year
343 6 1998
344 6 2005
345 10 1973
346 9 2005
347 03 1980
... ... ... ...
440 6 1989
441 9 1980
442 9 1992
443 9 2000
444 7 1981

102 rows × 3 columns

Concatenate the extracted dataframes

import numpy as np
import pandas as pd

frames = [df1, df2, df3,df5_1,df5_2 ,df5_3,df6]

result = pd.concat(frames)
# replace empy string with 1
result = result.replace(r'^\s*$', 1, regex=True)
result


month day year
0 03 25 93
1 6 18 85
2 7 8 71
3 9 27 75
4 2 6 96
... ... ... ...
440 6 1 1989
441 9 1 1980
442 9 1 1992
443 9 1 2000
444 7 1 1981

445 rows × 3 columns

df.shape
(445,)

The numbers of rows of the concatenated result and the original text file are the same. Great!

Final text cleaning of concatenated dataframe

  • Add leading zero.
  • Fix the year number with only two digits.
  • Merge the columns and convert the merged column to date type.
# add the leading 0 and add 19 for those two-digit year
for index, row in result.iterrows():
    # print(len(row['year']))
    if len(row['year'])<3 :
      row['year'] = '19' + row['year']
    if len(str(row['day'])) == 1 :
      row['day'] = '0' + str(row['day'])
    if len(str(row['month'])) == 1 :
      row['month'] = '0' + str(row['month'])
result


month day year
0 03 25 1993
1 06 18 1985
2 07 08 1971
3 09 27 1975
4 02 06 1996
... ... ... ...
440 06 01 1989
441 09 01 1980
442 09 01 1992
443 09 01 2000
444 07 01 1981

445 rows × 3 columns

result = result.applymap(str)
# add merged column "combine"
result["combine"] =  result["year"] + result["month"] + result["day"]

result['combine'] = result['combine'].astype(int)
# check rows with NULL
# is_NaN = df_findNA.isnull()
# row_has_NaN = is_NaN.any(axis=1)
# rows_with_NaN = df_findNA[row_has_NaN]
# rows_with_NaN

# check whch rows have unconverted month name
# row_has_NaN = result['combine'].str.contains('April111990')
# rows_with_NaN = result[row_has_NaN ]
# rows_with_NaN
print (result.dtypes)

0      19930325
1      19850618
2      19710708
3      19750927
4      19960206
         ...   
440    19890601
441    19800901
442    19920901
443    20000901
444    19810701
Name: combine, Length: 445, dtype: int64

The dtype and form are perfectly cleaned.

result['combine']
month      object
day        object
year       object
combine     int64
dtype: object

Sort the date


result['combine'] = pd.to_datetime(result['combine'], format='%Y%m%d')

date_for_sort = pd.to_datetime(result['combine'], format='%Y%m%d')

date_for_sort = pd.DataFrame(date_for_sort)
date_for_sort
combine
0 1993-03-25
1 1985-06-18
2 1971-07-08
3 1975-09-27
4 1996-02-06
... ...
440 1989-06-01
441 1980-09-01
442 1992-09-01
443 2000-09-01
444 1981-07-01

445 rows × 1 columns

date_for_sort = date_for_sort.sort_values(ascending=True, by = 'combine')
date_for_sort.reset_index(inplace=True)
date_for_sort
index combine
0 9 1971-04-10
1 84 1971-05-18
2 2 1971-07-08
3 53 1971-07-11
4 28 1971-09-12
... ... ...
440 427 2016-05-01
441 141 2016-05-30
442 186 2016-10-13
443 161 2016-10-19
444 413 2016-11-01

445 rows × 2 columns

date_for_sort['index']
0        9
1       84
2        2
3       53
4       28
      ... 
440    427
441    141
442    186
443    161
444    413
Name: index, Length: 445, dtype: int64

Some takeaway

  1. Be patient. In order to make this post looks neat, I deleted many testing codes and false extractions. Do not lose heart if you tried multiple times but yet could not extract the desired date format from the text.
  2. Do preliminary cleaning when importing the file. In this exercise I only remove the paranthesis when reading the file. I believe that if I could do more cleaning already when reading the file, the following text cleaning will be easier. For example, pre-clean dirty text like 'c1995', 'k1976'.

Thank you for reading, if there is any question please do not hesitate to let me know!

Prudence is a fountain of life to the prudent.