[英]How to create a function and apply for each row in pandas?
I have a somewhat-complex function that I am having difficulty writing. 我有一个复杂的功能,我写作有困难。 Essentially, I have a df that stores medical records and I need to identify the first site that a person goes to after their discharge date (I wish it was simple as choosing the first location after the initial stay, but it's not).
基本上,我有一个df存储医疗记录,我需要确定一个人在出院日期之后去的第一个网站(我希望在初次入住后选择第一个位置很简单,但事实并非如此)。 The df is grouped by
ID
. df按
ID
分组。
There are 3 options: (1) within a group, if any of the rows have a begin_date
that matches the first rows end_date
, return that location as the first site (if there are two rows that meet this condition, either are correct). 有3个选项:(1)在一个组中,如果任何行的
begin_date
与第一行end_date
匹配,则返回该位置作为第一个站点(如果有两行符合此条件,则两者都是正确的)。 (2) if the first option does not exist, then if there is an instance that the patient had location
'Health', then return 'Health'. (2)如果第一个选项不存在,那么如果存在患者
location
“健康”的实例,则返回“健康”。 (3) else, if conditions 1 and 2 do not exist, then return 'Home' (3)否则,如果条件1和2不存在,则返回'Home'
df DF
ID color begin_date end_date location
1 red 2017-01-01 2017-01-07 initial
1 green 2017-01-05 2017-01-07 nursing
1 blue 2017-01-07 2017-01-15 rehab
1 red 2017-01-11 2017-01-22 Health
2 red 2017-02-22 2017-02-26 initial
2 green 2017-02-26 2017-02-28 nursing
2 blue 2017-02-26 2017-02-28 rehab
3 red 2017-03-11 2017-03-22 initial
4 red 2017-04-01 2017-04-07 initial
4 green 2017-04-05 2017-04-07 nursing
4 blue 2017-04-10 2017-04-15 Health
finial result I am appending to a different df: 最终结果我附加到另一个df:
ID first_site
1 rehab
2 nursing
3 home
4 Health
My approach is to write a function with these conditions, then use apply()
to iterate over each row. 我的方法是使用这些条件编写函数,然后使用
apply()
迭代每一行。
def conditions(x):
if x['begin_date'].isin(x['end_date'].iloc[[0]]).any():
return x['location']
elif df[df['Health']] == True:
return 'Health'
else:
return 'Home'
final = pd.DateFrame()
final['first'] = df.groupby('ID').apply(lambda x: conditions(x))
I am getting an error: 我收到一个错误:
TypeError: incompatible index of inserted column with frame index
I think need: 我认为需要:
def conditions(x):
#compare each group first
val = x.loc[x['begin_date'] == x['end_date'].iloc[0], 'location']
#if at least one match (not return empty `Series` get first value)
if not val.empty:
return val.iloc[0]
#check if value Health
elif (x['location'] == 'Health').any():
return 'Health'
else:
return 'Home'
final = df.groupby('ID').apply(conditions).reset_index(name='first_site')
print (final)
ID first_site
0 1 rehab
1 2 nursing
2 3 Home
3 4 Health
If need new column remove reset_index
and add map
or use solution from comment, thank you @Oriol Mirosa: 如果需要新列删除
reset_index
并添加map
或使用评论中的解决方案,谢谢@Oriol Mirosa:
final = df.groupby('ID').apply(conditions)
df['first_site'] = df['ID'].map(final)
print (df)
ID color begin_date end_date location first_site
0 1 red 2017-01-01 2017-01-07 initial rehab
1 1 green 2017-01-05 2017-01-07 nursing rehab
2 1 blue 2017-01-07 2017-01-15 rehab rehab
3 1 red 2017-01-11 2017-01-22 Health rehab
4 2 red 2017-02-22 2017-02-26 initial nursing
5 2 green 2017-02-26 2017-02-28 nursing nursing
6 2 blue 2017-02-26 2017-02-28 rehab nursing
7 3 red 2017-03-11 2017-03-22 initial Home
8 4 red 2017-04-01 2017-04-07 initial Health
9 4 green 2017-04-05 2017-04-07 nursing Health
10 4 blue 2017-04-10 2017-04-15 Health Health
Apply
obviously is slow, if performance is important use: 显然
Apply
很慢,如果性能很重要使用:
#first filter by end date for each group
end = df.groupby('ID')['end_date'].transform('first')
df1 = df[(df['begin_date'] == end)]
#filter Health rows
df2 = df[(df['location'] == 'Health')]
#get filtered df together and remove duplicates, last reindex by all ID
#values for append missing ID rows
df3 = (pd.concat([df1, df2])
.drop_duplicates('ID')
.set_index('ID')['location']
.reindex(df['ID'].unique(), fill_value='Home')
.reset_index(name='first_site'))
print (df3)
ID first_site
0 1 rehab
1 2 nursing
2 3 Home
3 4 Health
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.