I have a somewhat-complex function that I am having difficulty writing. Essentially, I have a df that stores medical records and I need to identify the first site that a person goes to after their discharge date (I wish it was simple as choosing the first location after the initial stay, but it's not). The df is grouped by ID
.
There are 3 options: (1) within a group, if any of the rows have a begin_date
that matches the first rows end_date
, return that location as the first site (if there are two rows that meet this condition, either are correct). (2) if the first option does not exist, then if there is an instance that the patient had location
'Health', then return 'Health'. (3) else, if conditions 1 and 2 do not exist, then return 'Home'
df
ID color begin_date end_date location
1 red 2017-01-01 2017-01-07 initial
1 green 2017-01-05 2017-01-07 nursing
1 blue 2017-01-07 2017-01-15 rehab
1 red 2017-01-11 2017-01-22 Health
2 red 2017-02-22 2017-02-26 initial
2 green 2017-02-26 2017-02-28 nursing
2 blue 2017-02-26 2017-02-28 rehab
3 red 2017-03-11 2017-03-22 initial
4 red 2017-04-01 2017-04-07 initial
4 green 2017-04-05 2017-04-07 nursing
4 blue 2017-04-10 2017-04-15 Health
finial result I am appending to a different df:
ID first_site
1 rehab
2 nursing
3 home
4 Health
My approach is to write a function with these conditions, then use apply()
to iterate over each row.
def conditions(x):
if x['begin_date'].isin(x['end_date'].iloc[[0]]).any():
return x['location']
elif df[df['Health']] == True:
return 'Health'
else:
return 'Home'
final = pd.DateFrame()
final['first'] = df.groupby('ID').apply(lambda x: conditions(x))
I am getting an error:
TypeError: incompatible index of inserted column with frame index
I think need:
def conditions(x):
#compare each group first
val = x.loc[x['begin_date'] == x['end_date'].iloc[0], 'location']
#if at least one match (not return empty `Series` get first value)
if not val.empty:
return val.iloc[0]
#check if value Health
elif (x['location'] == 'Health').any():
return 'Health'
else:
return 'Home'
final = df.groupby('ID').apply(conditions).reset_index(name='first_site')
print (final)
ID first_site
0 1 rehab
1 2 nursing
2 3 Home
3 4 Health
If need new column remove reset_index
and add map
or use solution from comment, thank you @Oriol Mirosa:
final = df.groupby('ID').apply(conditions)
df['first_site'] = df['ID'].map(final)
print (df)
ID color begin_date end_date location first_site
0 1 red 2017-01-01 2017-01-07 initial rehab
1 1 green 2017-01-05 2017-01-07 nursing rehab
2 1 blue 2017-01-07 2017-01-15 rehab rehab
3 1 red 2017-01-11 2017-01-22 Health rehab
4 2 red 2017-02-22 2017-02-26 initial nursing
5 2 green 2017-02-26 2017-02-28 nursing nursing
6 2 blue 2017-02-26 2017-02-28 rehab nursing
7 3 red 2017-03-11 2017-03-22 initial Home
8 4 red 2017-04-01 2017-04-07 initial Health
9 4 green 2017-04-05 2017-04-07 nursing Health
10 4 blue 2017-04-10 2017-04-15 Health Health
Apply
obviously is slow, if performance is important use:
#first filter by end date for each group
end = df.groupby('ID')['end_date'].transform('first')
df1 = df[(df['begin_date'] == end)]
#filter Health rows
df2 = df[(df['location'] == 'Health')]
#get filtered df together and remove duplicates, last reindex by all ID
#values for append missing ID rows
df3 = (pd.concat([df1, df2])
.drop_duplicates('ID')
.set_index('ID')['location']
.reindex(df['ID'].unique(), fill_value='Home')
.reset_index(name='first_site'))
print (df3)
ID first_site
0 1 rehab
1 2 nursing
2 3 Home
3 4 Health
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.