NLP 1:Text Mining Application in Python (RegEx)

2021-06-15 · 21 min read

In this exercise, we'll be working with messy medical data and using RegEx in Python to extract dates of different formats. The goal of this exercise is to correctly identify all of the different date variants encoded in this dataset and to properly standardize and sort the dates.

Introduction

Each line of the dates.txt file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.

Here is a list of some of the possible variants in this dataset:

04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009

Once we extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:

Assume all dates in xx/xx/xx format are mm/dd/yy
Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
There could be potential typos as this is a raw, real-life derived dataset.

With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices.

For example if the original series was this:

We should return this:

As a result, we could know the chronological order of the medical records in the text file.

Useful cheetsheet

Meta-characters: Character matches

.: wildcard, matches a single character
^: start of a string
$: end of a string
[]: matches one of the set of characters within []
[a-z]: matches one of the range of characters a, b, ..., z
a|b: matches either a or b, where a and b are strings
(): Scoping for operators.
backslash: escape character for special characters (\t,\n,\b)
\b: matches word boundary
\d: any digit, equivalent to [^0-9]
\D: any non-digit, equivalent to [^0-9]
\s: any whitespace, equivalent to [\t\n\r\f\v]
\S: any non-whitespace, equivalent to [^ \t\n\r\f\v]
\w: Alphanumeric character, euivalent to [a-zA-Z0-9]
\W: Non-alphanumeric, equivalent to [^a-zA-Z0-9]

Meta-characters: Repetitions

*: matches zero or more occurences
+: matches one or more occurrences
?: matches zero or one occurences
{n}: exactly n repetitions, 0 $\leq$ n
{,n}: at most n repetitions
{m,n}: at least m and at most n repetitions

Import data

I am doing this exercise on Google Colab. We can easily import the data from Google Drive to Colab by using the following code. More detailed instruction can be found here.

Data is available here.

When reading the text, I removed the paranthesis in order to facilitate text cleaning.

# use the following code if you want to connect colab to google drive
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

import pandas as pd
import re as re

doc = []
with open('/content/drive/MyDrive/Bolin_DSPost/dates.txt') as file:
    for line in file:
      # remove the paranthesis in each record
        line = re.sub('[()]', '', line)
        doc.append(line)

df = pd.Series(doc)
df.head(10)

0           03/25/93 Total time of visit in minutes:\n
1                       6/18/85 Primary Care Doctor:\n
2    sshe plans to move as of 7/8/71 In-Home Servic...
3                7 on 9/27/75 Audit C Score Current:\n
4    2/6/96 sleep studyPain Treatment Pain Level Nu...
5                    .Per 7/06/79 Movement D/O note:\n
6    4, 5/18/78 Patient's thoughts about current su...
7    10/24/89 CPT Code: 90801 - Psychiatric Diagnos...
8                         3/7/86 SOS-10 Total Score:\n
9               4/10/71Score-1Audit C Score Current:\n
dtype: object

Extract date and text cleaning

Now we will proceed to cleaning the data. Instead of cleaning all the 6 types mentioned above at once, we will clean one type at a step. It will make debug more easily.

Some useful resources of dealing with free text are listed as follows:

Regular Expressions
Regular expressions documentation in Python 3

Tips and tricks of the trade for cleaning text in Python

https://chrisalbon.com/python/basics/cleaning_text/

https://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/

http://ieva.rocks/2016/08/07/cleaning-text-for-nlp/

https://chrisalbon.com/python/cleaning_text.html

1. mm/dd/yy format

# 1 
df1 = df.str.extract(r'(?P<month>\d{1,2})[/-](?P<day>\d{1,2})[/-](?P<year>\d{2,4})').dropna(how = 'any')
df1
#done

	month	day	year
0	03	25	93
1	6	18	85
2	7	8	71
3	9	27	75
4	2	6	96
...	...	...	...
120	12	22	98
121	10	02	96
122	11	05	90
123	5	04	77
124	2	27	96

125 rows × 3 columns

2. Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009 format

# 2
df2 = df.str.extract(r'(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)[ ,.-]+(?P<day>\d{2,4})[ ,.-]+(?P<year>\d{2,4})').dropna(how = 'any')#.iloc[125]
df2
# done

	month	day	year
194	April	11	1990
195	May	30	2001
196	Feb	18	1994
197	February	18	1981
198	October	11	2013
199	Jan	24	1986
200	July	26	1978
201	December	23	1999
202	May	15	1989
203	September	06	1995
204	Mar	10	1976
205	Jan	27	1983
206	October	23	1990
207	August	12	2004
208	September	01	2012
209	July	25	1983
210	August	11	1989
211	April	17	1992
212	July	24	1999
213	July	11	1997
214	Sep	10	1974
215	August	14	1981
216	Nov	11	1988
217	June	13	2011
218	May	14	1989
219	Dec	14	1975
220	June	25	2012
221	Oct	18	1980
222	May	15	1998
223	October	14	1974
224	July	25	1998
225	June	15	1972
226	January	07	1991
227	September	15	2011

Change month name to a number

# # Change month name to a number
def month_string_to_number(string):
    m = {
        'jan': 1,
        'feb': 2,
        'mar': 3,
        'apr':4,
         'may':5,
         'jun':6,
         'jul':7,
         'aug':8,
         'sep':9,
         'oct':10,
         'nov':11,
         'dec':12
        }
    s = string.strip()[:3].lower()

    try:
        out = m[s]
        return out
    except:
        raise ValueError('Not a month')

# test if the function works
month_string_to_number('October')

Cool, it works!

for index, row in df2.iterrows():
    # print(row['month'])
    row['month'] = month_string_to_number(row['month'])
df2

	month	day	year
194	4	11	1990
195	5	30	2001
196	2	18	1994
197	2	18	1981
198	10	11	2013
199	1	24	1986
200	7	26	1978
201	12	23	1999
202	5	15	1989
203	9	06	1995
204	3	10	1976
205	1	27	1983
206	10	23	1990
207	8	12	2004
208	9	01	2012
209	7	25	1983
210	8	11	1989
211	4	17	1992
212	7	24	1999
213	7	11	1997
214	9	10	1974
215	8	14	1981
216	11	11	1988
217	6	13	2011
218	5	14	1989
219	12	14	1975
220	6	25	2012
221	10	18	1980
222	5	15	1998
223	10	14	1974
224	7	25	1998
225	6	15	1972
226	1	07	1991
227	9	15	2011

3. 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009 format

# 3
df3 = df.str.extract(r'(?P<day>\d{1,2})[ ,.-](?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)[ ,.-](?P<year>\d{2,4})').dropna(how = 'any')#.iloc[125]
#done
df3

	day	month	year
125	24	Jan	2001
126	10	Sep	2004
127	26	May	1982
128	28	June	2002
129	06	May	1972
...	...	...	...
189	21	Oct	1977
190	10	Aug	2000
191	30	Nov	1972
192	06	May	1993
193	18	Jan	1995

69 rows × 3 columns

# change month name to number
for index, row in df3.iterrows():
    # print(row['month'])
    row['month'] = month_string_to_number(row['month'])
df3

	day	month	year
125	24	1	2001
126	10	9	2004
127	26	5	1982
128	28	6	2002
129	06	5	1972
...	...	...	...
189	21	10	1977
190	10	8	2000
191	30	11	1972
192	06	5	1993
193	18	1	1995

69 rows × 3 columns

4. Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009 format

# 4
df.str.extract(r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)[ ,.](\d{2,4})(?:th|st|rd), (\d{2,4})').dropna(how = 'any')

	0	1	2

No record is written in this format.

5. Feb 2009; Sep 2009; Oct 2010 format

Many records with date in this format have different noises. Therefore for this format I will first make a general regex expreesion to extract the majority date records in this formate. Then I will search for the records that are not extracted, find the reason and modify the regex expression.

# 5
df5_1 = df.str.extract(r'\D(?P<day> )(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*),?[ ](?P<year>\d{4})').dropna(how = 'any')#.iloc[125]
df5_1

	day	month	year
228		September	1985
229		June	2011
230		May	1986
231		May	2016
232		July	1977
...	...	...	...
337		Dec	2007
338		Apr	1998
339		March	2005
340		May	1980
342		March	1976

98 rows × 3 columns

for index, row in df5_1.iterrows():
    # print(row['month'])
    row['month'] = month_string_to_number(row['month'])
df5_1

	day	month	year
228		9	1985
229		6	2011
230		5	1986
231		5	2016
232		7	1977
...	...	...	...
337		12	2007
338		4	1998
339		3	2005
340		5	1980
342		3	1976

98 rows × 3 columns

Find missing indices:

# find missing indices
missing_indices = [index 
                   for index in range(df5_1.index[0], df5_1.index[len(df5_1.index)-1])
                   if index not in df5_1.index]

missing_indices

We can see that some records are missing, e.g. 341. Let us see how this record looks like.

df.iloc[341]

'.Nov 2007- Evaluation and treatment at CH ED/APS for acute alcohol and medication OD, with history of depression relapse and alcohol abuse PRIOR TWO WEEKS due to financial stress when she lost some Meteorological Technician work.  Discharged to detox and follow up with Fresenius Medical Care Center.\n'

# 5
# fix one character right before Month, like yAug, 2004
df5_2 = df.str.extract(r'^[\D\. ]?(?P<day>)(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)[,. ]*(?P<year>\d{4})\D').dropna(how = 'any')#.iloc[]

df5_2

	month	year
235	Oct	2015
245	November	1990
247	May	1983
251	December	1998
256	Aug	1988
266	September	1999
270	May	2006
276	April	1986
281	Aug	2004
283	Feb	1977
285	Sep	1983
294	February	1983
299	March	1974
306	May	2004
333	November	1997
341	Nov	2007

for index, row in df5_2.iterrows():
    # print(row['month'])
    row['month'] = month_string_to_number(row['month'])
df5_2

	month	year
235	10	2015
245	11	1990
247	5	1983
251	12	1998
256	8	1988
266	9	1999
270	5	2006
276	4	1986
281	8	2004
283	2	1977
285	9	1983
294	2	1983
299	3	1974
306	5	2004
333	11	1997
341	11	2007

# find missing indices after the first search
missing_indices = [index 
                   for index in range(df5_1.index[0], df5_1.index[len(df5_1.index)-1])
                   if index not in (df5_1.index | df5_2.index)]
missing_indices

[321]

Still, record 321 is missing.

df.iloc[321]

'2June, 1999 Audit C Score Current:\n'

# 5
# fix one number right before Month, like 2June, 1999
df5_3 = df.str.extract(r'[^]*[\D\. ](?P<day>)(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)[,. ]*(?P<year>\d{4})\D').dropna(how = 'any')#.iloc[]
df5_3

	day	month	year
321		June	1999

for index, row in df5_3.iterrows():
    # print(row['month'])
    row['month'] = month_string_to_number(row['month'])
df5_3

	day	month	year
321		6	1999

6. 6/2008; 12/2009 format

# 6 
df6 = df.str.extract(r'(?:^| |~|[a-z])+-?(?P<day>)(?P<month>\d{1,2})[/-](?P<year>\d{4})').dropna(how = 'any')#.iloc[343:454]
df6

	day	month	year
343		6	1998
344		6	2005
345		10	1973
346		9	2005
347		03	1980
...	...	...	...
440		6	1989
441		9	1980
442		9	1992
443		9	2000
444		7	1981

102 rows × 3 columns

Concatenate the extracted dataframes

import numpy as np
import pandas as pd

frames = [df1, df2, df3,df5_1,df5_2 ,df5_3,df6]

result = pd.concat(frames)
# replace empy string with 1
result = result.replace(r'^\s*$', 1, regex=True)
result

	month	day	year
0	03	25	93
1	6	18	85
2	7	8	71
3	9	27	75
4	2	6	96
...	...	...	...
440	6	1	1989
441	9	1	1980
442	9	1	1992
443	9	1	2000
444	7	1	1981

445 rows × 3 columns

df.shape

(445,)

The numbers of rows of the concatenated result and the original text file are the same. Great!

Final text cleaning of concatenated dataframe

Add leading zero.
Fix the year number with only two digits.
Merge the columns and convert the merged column to date type.

# add the leading 0 and add 19 for those two-digit year
for index, row in result.iterrows():
    # print(len(row['year']))
    if len(row['year'])<3 :
      row['year'] = '19' + row['year']
    if len(str(row['day'])) == 1 :
      row['day'] = '0' + str(row['day'])
    if len(str(row['month'])) == 1 :
      row['month'] = '0' + str(row['month'])
result

	month	day	year
0	03	25	1993
1	06	18	1985
2	07	08	1971
3	09	27	1975
4	02	06	1996
...	...	...	...
440	06	01	1989
441	09	01	1980
442	09	01	1992
443	09	01	2000
444	07	01	1981

445 rows × 3 columns

result = result.applymap(str)

# add merged column "combine"
result["combine"] =  result["year"] + result["month"] + result["day"]

result['combine'] = result['combine'].astype(int)

# check rows with NULL
# is_NaN = df_findNA.isnull()
# row_has_NaN = is_NaN.any(axis=1)
# rows_with_NaN = df_findNA[row_has_NaN]
# rows_with_NaN

# check whch rows have unconverted month name
# row_has_NaN = result['combine'].str.contains('April111990')
# rows_with_NaN = result[row_has_NaN ]
# rows_with_NaN

print (result.dtypes)

0      19930325
1      19850618
2      19710708
3      19750927
4      19960206
         ...   
440    19890601
441    19800901
442    19920901
443    20000901
444    19810701
Name: combine, Length: 445, dtype: int64

The dtype and form are perfectly cleaned.

result['combine']

month      object
day        object
year       object
combine     int64
dtype: object

Sort the date


result['combine'] = pd.to_datetime(result['combine'], format='%Y%m%d')

date_for_sort = pd.to_datetime(result['combine'], format='%Y%m%d')

date_for_sort = pd.DataFrame(date_for_sort)
date_for_sort

	combine
0	1993-03-25
1	1985-06-18
2	1971-07-08
3	1975-09-27
4	1996-02-06
...	...
440	1989-06-01
441	1980-09-01
442	1992-09-01
443	2000-09-01
444	1981-07-01

445 rows × 1 columns

date_for_sort = date_for_sort.sort_values(ascending=True, by = 'combine')
date_for_sort.reset_index(inplace=True)
date_for_sort

	index	combine
0	9	1971-04-10
1	84	1971-05-18
2	2	1971-07-08
3	53	1971-07-11
4	28	1971-09-12
...	...	...
440	427	2016-05-01
441	141	2016-05-30
442	186	2016-10-13
443	161	2016-10-19
444	413	2016-11-01

445 rows × 2 columns

date_for_sort['index']

0        9
1       84
2        2
3       53
4       28
      ... 
440    427
441    141
442    186
443    161
444    413
Name: index, Length: 445, dtype: int64

Some takeaway

Be patient. In order to make this post looks neat, I deleted many testing codes and false extractions. Do not lose heart if you tried multiple times but yet could not extract the desired date format from the text.
Do preliminary cleaning when importing the file. In this exercise I only remove the paranthesis when reading the file. I believe that if I could do more cleaning already when reading the file, the following text cleaning will be easier. For example, pre-clean dirty text like 'c1995', 'k1976'.

Thank you for reading, if there is any question please do not hesitate to let me know!

Introduction
- Useful cheetsheet
  - Meta-characters: Character matches
  - Meta-characters: Repetitions
Import data
Extract date and text cleaning
Concatenate the extracted dataframes
Final text cleaning of concatenated dataframe
Sort the date
Some takeaway

	month	day	year
0	03	25	93
1	6	18	85
2	7	8	71
3	9	27	75
4	2	6	96
...	...	...	...
120	12	22	98
121	10	02	96
122	11	05	90
123	5	04	77
124	2	27	96

	month	day	year
0	03	25	93
1	6	18	85
2	7	8	71
3	9	27	75
4	2	6	96
...	...	...	...
120	12	22	98
121	10	02	96
122	11	05	90
123	5	04	77
124	2	27	96

	month	day	year
0	03	25	93
1	6	18	85
2	7	8	71
3	9	27	75
4	2	6	96
...	...	...	...
120	12	22	98
121	10	02	96
122	11	05	90
123	5	04	77
124	2	27	96