In this exercise, we'll be working with messy medical data and using RegEx in Python to extract dates of different formats. The goal of this exercise is to correctly identify all of the different date variants encoded in this dataset and to properly standardize and sort the dates.
Each line of the dates.txt file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.
Here is a list of some of the possible variants in this dataset:
Once we extracted these date patterns from the text, the next step is to sort them in ascending chronological order accoring to the following rules:
With these rules in mind, find the correct date in each note and return a pandas Series in chronological order of the original Series' indices.
For example if the original series was this:
0 1999
1 2010
2 1978
3 2015
4 1985
We should return this:
0 2
1 4
2 0
3 1
4 3
As a result, we could know the chronological order of the medical records in the text file.
I am doing this exercise on Google Colab. We can easily import the data from Google Drive to Colab by using the following code. More detailed instruction can be found here.
Data is available here.
When reading the text, I removed the paranthesis in order to facilitate text cleaning.
# use the following code if you want to connect colab to google drive
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
import pandas as pd
import re as re
doc = []
with open('/content/drive/MyDrive/Bolin_DSPost/dates.txt') as file:
for line in file:
# remove the paranthesis in each record
line = re.sub('[()]', '', line)
doc.append(line)
df = pd.Series(doc)
df.head(10)
0 03/25/93 Total time of visit in minutes:\n
1 6/18/85 Primary Care Doctor:\n
2 sshe plans to move as of 7/8/71 In-Home Servic...
3 7 on 9/27/75 Audit C Score Current:\n
4 2/6/96 sleep studyPain Treatment Pain Level Nu...
5 .Per 7/06/79 Movement D/O note:\n
6 4, 5/18/78 Patient's thoughts about current su...
7 10/24/89 CPT Code: 90801 - Psychiatric Diagnos...
8 3/7/86 SOS-10 Total Score:\n
9 4/10/71Score-1Audit C Score Current:\n
dtype: object
Now we will proceed to cleaning the data. Instead of cleaning all the 6 types mentioned above at once, we will clean one type at a step. It will make debug more easily.
Some useful resources of dealing with free text are listed as follows:
Regular Expressions
Regular expressions documentation in Python 3
Tips and tricks of the trade for cleaning text in Python
https://chrisalbon.com/python/basics/cleaning_text/
https://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/
http://ieva.rocks/2016/08/07/cleaning-text-for-nlp/
https://chrisalbon.com/python/cleaning_text.html
# 1
df1 = df.str.extract(r'(?P<month>\d{1,2})[/-](?P<day>\d{1,2})[/-](?P<year>\d{2,4})').dropna(how = 'any')
df1
#done
month | day | year | |
---|---|---|---|
0 | 03 | 25 | 93 |
1 | 6 | 18 | 85 |
2 | 7 | 8 | 71 |
3 | 9 | 27 | 75 |
4 | 2 | 6 | 96 |
... | ... | ... | ... |
120 | 12 | 22 | 98 |
121 | 10 | 02 | 96 |
122 | 11 | 05 | 90 |
123 | 5 | 04 | 77 |
124 | 2 | 27 | 96 |
125 rows × 3 columns
# 2
df2 = df.str.extract(r'(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)[ ,.-]+(?P<day>\d{2,4})[ ,.-]+(?P<year>\d{2,4})').dropna(how = 'any')#.iloc[125]
df2
# done
month | day | year | |
---|---|---|---|
194 | April | 11 | 1990 |
195 | May | 30 | 2001 |
196 | Feb | 18 | 1994 |
197 | February | 18 | 1981 |
198 | October | 11 | 2013 |
199 | Jan | 24 | 1986 |
200 | July | 26 | 1978 |
201 | December | 23 | 1999 |
202 | May | 15 | 1989 |
203 | September | 06 | 1995 |
204 | Mar | 10 | 1976 |
205 | Jan | 27 | 1983 |
206 | October | 23 | 1990 |
207 | August | 12 | 2004 |
208 | September | 01 | 2012 |
209 | July | 25 | 1983 |
210 | August | 11 | 1989 |
211 | April | 17 | 1992 |
212 | July | 24 | 1999 |
213 | July | 11 | 1997 |
214 | Sep | 10 | 1974 |
215 | August | 14 | 1981 |
216 | Nov | 11 | 1988 |
217 | June | 13 | 2011 |
218 | May | 14 | 1989 |
219 | Dec | 14 | 1975 |
220 | June | 25 | 2012 |
221 | Oct | 18 | 1980 |
222 | May | 15 | 1998 |
223 | October | 14 | 1974 |
224 | July | 25 | 1998 |
225 | June | 15 | 1972 |
226 | January | 07 | 1991 |
227 | September | 15 | 2011 |
# # Change month name to a number
def month_string_to_number(string):
m = {
'jan': 1,
'feb': 2,
'mar': 3,
'apr':4,
'may':5,
'jun':6,
'jul':7,
'aug':8,
'sep':9,
'oct':10,
'nov':11,
'dec':12
}
s = string.strip()[:3].lower()
try:
out = m[s]
return out
except:
raise ValueError('Not a month')
# test if the function works
month_string_to_number('October')
10
Cool, it works!
for index, row in df2.iterrows():
# print(row['month'])
row['month'] = month_string_to_number(row['month'])
df2
month | day | year | |
---|---|---|---|
194 | 4 | 11 | 1990 |
195 | 5 | 30 | 2001 |
196 | 2 | 18 | 1994 |
197 | 2 | 18 | 1981 |
198 | 10 | 11 | 2013 |
199 | 1 | 24 | 1986 |
200 | 7 | 26 | 1978 |
201 | 12 | 23 | 1999 |
202 | 5 | 15 | 1989 |
203 | 9 | 06 | 1995 |
204 | 3 | 10 | 1976 |
205 | 1 | 27 | 1983 |
206 | 10 | 23 | 1990 |
207 | 8 | 12 | 2004 |
208 | 9 | 01 | 2012 |
209 | 7 | 25 | 1983 |
210 | 8 | 11 | 1989 |
211 | 4 | 17 | 1992 |
212 | 7 | 24 | 1999 |
213 | 7 | 11 | 1997 |
214 | 9 | 10 | 1974 |
215 | 8 | 14 | 1981 |
216 | 11 | 11 | 1988 |
217 | 6 | 13 | 2011 |
218 | 5 | 14 | 1989 |
219 | 12 | 14 | 1975 |
220 | 6 | 25 | 2012 |
221 | 10 | 18 | 1980 |
222 | 5 | 15 | 1998 |
223 | 10 | 14 | 1974 |
224 | 7 | 25 | 1998 |
225 | 6 | 15 | 1972 |
226 | 1 | 07 | 1991 |
227 | 9 | 15 | 2011 |
# 3
df3 = df.str.extract(r'(?P<day>\d{1,2})[ ,.-](?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)[ ,.-](?P<year>\d{2,4})').dropna(how = 'any')#.iloc[125]
#done
df3
day | month | year | |
---|---|---|---|
125 | 24 | Jan | 2001 |
126 | 10 | Sep | 2004 |
127 | 26 | May | 1982 |
128 | 28 | June | 2002 |
129 | 06 | May | 1972 |
... | ... | ... | ... |
189 | 21 | Oct | 1977 |
190 | 10 | Aug | 2000 |
191 | 30 | Nov | 1972 |
192 | 06 | May | 1993 |
193 | 18 | Jan | 1995 |
69 rows × 3 columns
# change month name to number
for index, row in df3.iterrows():
# print(row['month'])
row['month'] = month_string_to_number(row['month'])
df3
day | month | year | |
---|---|---|---|
125 | 24 | 1 | 2001 |
126 | 10 | 9 | 2004 |
127 | 26 | 5 | 1982 |
128 | 28 | 6 | 2002 |
129 | 06 | 5 | 1972 |
... | ... | ... | ... |
189 | 21 | 10 | 1977 |
190 | 10 | 8 | 2000 |
191 | 30 | 11 | 1972 |
192 | 06 | 5 | 1993 |
193 | 18 | 1 | 1995 |
69 rows × 3 columns
# 4
df.str.extract(r'((?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)[ ,.](\d{2,4})(?:th|st|rd), (\d{2,4})').dropna(how = 'any')
0 | 1 | 2 |
---|
No record is written in this format.
Many records with date in this format have different noises. Therefore for this format I will first make a general regex expreesion to extract the majority date records in this formate. Then I will search for the records that are not extracted, find the reason and modify the regex expression.
# 5
df5_1 = df.str.extract(r'\D(?P<day> )(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*),?[ ](?P<year>\d{4})').dropna(how = 'any')#.iloc[125]
df5_1
day | month | year | |
---|---|---|---|
228 | September | 1985 | |
229 | June | 2011 | |
230 | May | 1986 | |
231 | May | 2016 | |
232 | July | 1977 | |
... | ... | ... | ... |
337 | Dec | 2007 | |
338 | Apr | 1998 | |
339 | March | 2005 | |
340 | May | 1980 | |
342 | March | 1976 |
98 rows × 3 columns
for index, row in df5_1.iterrows():
# print(row['month'])
row['month'] = month_string_to_number(row['month'])
df5_1
day | month | year | |
---|---|---|---|
228 | 9 | 1985 | |
229 | 6 | 2011 | |
230 | 5 | 1986 | |
231 | 5 | 2016 | |
232 | 7 | 1977 | |
... | ... | ... | ... |
337 | 12 | 2007 | |
338 | 4 | 1998 | |
339 | 3 | 2005 | |
340 | 5 | 1980 | |
342 | 3 | 1976 |
98 rows × 3 columns
Find missing indices:
# find missing indices
missing_indices = [index
for index in range(df5_1.index[0], df5_1.index[len(df5_1.index)-1])
if index not in df5_1.index]
missing_indices
[235,
245,
247,
251,
256,
266,
270,
276,
281,
283,
285,
294,
299,
306,
321,
333,
341]
We can see that some records are missing, e.g. 341. Let us see how this record looks like.
df.iloc[341]
'.Nov 2007- Evaluation and treatment at CH ED/APS for acute alcohol and medication OD, with history of depression relapse and alcohol abuse PRIOR TWO WEEKS due to financial stress when she lost some Meteorological Technician work. Discharged to detox and follow up with Fresenius Medical Care Center.\n'
# 5
# fix one character right before Month, like yAug, 2004
df5_2 = df.str.extract(r'^[\D\. ]?(?P<day>)(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)[,. ]*(?P<year>\d{4})\D').dropna(how = 'any')#.iloc[]
df5_2
day | month | year | |
---|---|---|---|
235 | Oct | 2015 | |
245 | November | 1990 | |
247 | May | 1983 | |
251 | December | 1998 | |
256 | Aug | 1988 | |
266 | September | 1999 | |
270 | May | 2006 | |
276 | April | 1986 | |
281 | Aug | 2004 | |
283 | Feb | 1977 | |
285 | Sep | 1983 | |
294 | February | 1983 | |
299 | March | 1974 | |
306 | May | 2004 | |
333 | November | 1997 | |
341 | Nov | 2007 |
for index, row in df5_2.iterrows():
# print(row['month'])
row['month'] = month_string_to_number(row['month'])
df5_2
day | month | year | |
---|---|---|---|
235 | 10 | 2015 | |
245 | 11 | 1990 | |
247 | 5 | 1983 | |
251 | 12 | 1998 | |
256 | 8 | 1988 | |
266 | 9 | 1999 | |
270 | 5 | 2006 | |
276 | 4 | 1986 | |
281 | 8 | 2004 | |
283 | 2 | 1977 | |
285 | 9 | 1983 | |
294 | 2 | 1983 | |
299 | 3 | 1974 | |
306 | 5 | 2004 | |
333 | 11 | 1997 | |
341 | 11 | 2007 |
# find missing indices after the first search
missing_indices = [index
for index in range(df5_1.index[0], df5_1.index[len(df5_1.index)-1])
if index not in (df5_1.index | df5_2.index)]
missing_indices
[321]
Still, record 321 is missing.
df.iloc[321]
'2June, 1999 Audit C Score Current:\n'
# 5
# fix one number right before Month, like 2June, 1999
df5_3 = df.str.extract(r'[^]*[\D\. ](?P<day>)(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)[,. ]*(?P<year>\d{4})\D').dropna(how = 'any')#.iloc[]
df5_3
day | month | year | |
---|---|---|---|
321 | June | 1999 |
for index, row in df5_3.iterrows():
# print(row['month'])
row['month'] = month_string_to_number(row['month'])
df5_3
day | month | year | |
---|---|---|---|
321 | 6 | 1999 |
# 6
df6 = df.str.extract(r'(?:^| |~|[a-z])+-?(?P<day>)(?P<month>\d{1,2})[/-](?P<year>\d{4})').dropna(how = 'any')#.iloc[343:454]
df6
day | month | year | |
---|---|---|---|
343 | 6 | 1998 | |
344 | 6 | 2005 | |
345 | 10 | 1973 | |
346 | 9 | 2005 | |
347 | 03 | 1980 | |
... | ... | ... | ... |
440 | 6 | 1989 | |
441 | 9 | 1980 | |
442 | 9 | 1992 | |
443 | 9 | 2000 | |
444 | 7 | 1981 |
102 rows × 3 columns
import numpy as np
import pandas as pd
frames = [df1, df2, df3,df5_1,df5_2 ,df5_3,df6]
result = pd.concat(frames)
# replace empy string with 1
result = result.replace(r'^\s*$', 1, regex=True)
result
month | day | year | |
---|---|---|---|
0 | 03 | 25 | 93 |
1 | 6 | 18 | 85 |
2 | 7 | 8 | 71 |
3 | 9 | 27 | 75 |
4 | 2 | 6 | 96 |
... | ... | ... | ... |
440 | 6 | 1 | 1989 |
441 | 9 | 1 | 1980 |
442 | 9 | 1 | 1992 |
443 | 9 | 1 | 2000 |
444 | 7 | 1 | 1981 |
445 rows × 3 columns
df.shape
(445,)
The numbers of rows of the concatenated result and the original text file are the same. Great!
# add the leading 0 and add 19 for those two-digit year
for index, row in result.iterrows():
# print(len(row['year']))
if len(row['year'])<3 :
row['year'] = '19' + row['year']
if len(str(row['day'])) == 1 :
row['day'] = '0' + str(row['day'])
if len(str(row['month'])) == 1 :
row['month'] = '0' + str(row['month'])
result
month | day | year | |
---|---|---|---|
0 | 03 | 25 | 1993 |
1 | 06 | 18 | 1985 |
2 | 07 | 08 | 1971 |
3 | 09 | 27 | 1975 |
4 | 02 | 06 | 1996 |
... | ... | ... | ... |
440 | 06 | 01 | 1989 |
441 | 09 | 01 | 1980 |
442 | 09 | 01 | 1992 |
443 | 09 | 01 | 2000 |
444 | 07 | 01 | 1981 |
445 rows × 3 columns
result = result.applymap(str)
# add merged column "combine"
result["combine"] = result["year"] + result["month"] + result["day"]
result['combine'] = result['combine'].astype(int)
# check rows with NULL
# is_NaN = df_findNA.isnull()
# row_has_NaN = is_NaN.any(axis=1)
# rows_with_NaN = df_findNA[row_has_NaN]
# rows_with_NaN
# check whch rows have unconverted month name
# row_has_NaN = result['combine'].str.contains('April111990')
# rows_with_NaN = result[row_has_NaN ]
# rows_with_NaN
print (result.dtypes)
0 19930325
1 19850618
2 19710708
3 19750927
4 19960206
...
440 19890601
441 19800901
442 19920901
443 20000901
444 19810701
Name: combine, Length: 445, dtype: int64
The dtype and form are perfectly cleaned.
result['combine']
month object
day object
year object
combine int64
dtype: object
result['combine'] = pd.to_datetime(result['combine'], format='%Y%m%d')
date_for_sort = pd.to_datetime(result['combine'], format='%Y%m%d')
date_for_sort = pd.DataFrame(date_for_sort)
date_for_sort
combine | |
---|---|
0 | 1993-03-25 |
1 | 1985-06-18 |
2 | 1971-07-08 |
3 | 1975-09-27 |
4 | 1996-02-06 |
... | ... |
440 | 1989-06-01 |
441 | 1980-09-01 |
442 | 1992-09-01 |
443 | 2000-09-01 |
444 | 1981-07-01 |
445 rows × 1 columns
date_for_sort = date_for_sort.sort_values(ascending=True, by = 'combine')
date_for_sort.reset_index(inplace=True)
date_for_sort
index | combine | |
---|---|---|
0 | 9 | 1971-04-10 |
1 | 84 | 1971-05-18 |
2 | 2 | 1971-07-08 |
3 | 53 | 1971-07-11 |
4 | 28 | 1971-09-12 |
... | ... | ... |
440 | 427 | 2016-05-01 |
441 | 141 | 2016-05-30 |
442 | 186 | 2016-10-13 |
443 | 161 | 2016-10-19 |
444 | 413 | 2016-11-01 |
445 rows × 2 columns
date_for_sort['index']
0 9
1 84
2 2
3 53
4 28
...
440 427
441 141
442 186
443 161
444 413
Name: index, Length: 445, dtype: int64
Thank you for reading, if there is any question please do not hesitate to let me know!