![](/img/trans.png)
[英]Apply function to each row of pandas dataframe to create two new columns
[英]How to create a function and apply for each row in pandas?
我有一個復雜的功能,我寫作有困難。 基本上,我有一個df存儲醫療記錄,我需要確定一個人在出院日期之后去的第一個網站(我希望在初次入住后選擇第一個位置很簡單,但事實並非如此)。 df按ID
分組。
有3個選項:(1)在一個組中,如果任何行的begin_date
與第一行end_date
匹配,則返回該位置作為第一個站點(如果有兩行符合此條件,則兩者都是正確的)。 (2)如果第一個選項不存在,那么如果存在患者location
“健康”的實例,則返回“健康”。 (3)否則,如果條件1和2不存在,則返回'Home'
DF
ID color begin_date end_date location
1 red 2017-01-01 2017-01-07 initial
1 green 2017-01-05 2017-01-07 nursing
1 blue 2017-01-07 2017-01-15 rehab
1 red 2017-01-11 2017-01-22 Health
2 red 2017-02-22 2017-02-26 initial
2 green 2017-02-26 2017-02-28 nursing
2 blue 2017-02-26 2017-02-28 rehab
3 red 2017-03-11 2017-03-22 initial
4 red 2017-04-01 2017-04-07 initial
4 green 2017-04-05 2017-04-07 nursing
4 blue 2017-04-10 2017-04-15 Health
最終結果我附加到另一個df:
ID first_site
1 rehab
2 nursing
3 home
4 Health
我的方法是使用這些條件編寫函數,然后使用apply()
迭代每一行。
def conditions(x):
if x['begin_date'].isin(x['end_date'].iloc[[0]]).any():
return x['location']
elif df[df['Health']] == True:
return 'Health'
else:
return 'Home'
final = pd.DateFrame()
final['first'] = df.groupby('ID').apply(lambda x: conditions(x))
我收到一個錯誤:
TypeError: incompatible index of inserted column with frame index
我認為需要:
def conditions(x):
#compare each group first
val = x.loc[x['begin_date'] == x['end_date'].iloc[0], 'location']
#if at least one match (not return empty `Series` get first value)
if not val.empty:
return val.iloc[0]
#check if value Health
elif (x['location'] == 'Health').any():
return 'Health'
else:
return 'Home'
final = df.groupby('ID').apply(conditions).reset_index(name='first_site')
print (final)
ID first_site
0 1 rehab
1 2 nursing
2 3 Home
3 4 Health
如果需要新列刪除reset_index
並添加map
或使用評論中的解決方案,謝謝@Oriol Mirosa:
final = df.groupby('ID').apply(conditions)
df['first_site'] = df['ID'].map(final)
print (df)
ID color begin_date end_date location first_site
0 1 red 2017-01-01 2017-01-07 initial rehab
1 1 green 2017-01-05 2017-01-07 nursing rehab
2 1 blue 2017-01-07 2017-01-15 rehab rehab
3 1 red 2017-01-11 2017-01-22 Health rehab
4 2 red 2017-02-22 2017-02-26 initial nursing
5 2 green 2017-02-26 2017-02-28 nursing nursing
6 2 blue 2017-02-26 2017-02-28 rehab nursing
7 3 red 2017-03-11 2017-03-22 initial Home
8 4 red 2017-04-01 2017-04-07 initial Health
9 4 green 2017-04-05 2017-04-07 nursing Health
10 4 blue 2017-04-10 2017-04-15 Health Health
顯然Apply
很慢,如果性能很重要使用:
#first filter by end date for each group
end = df.groupby('ID')['end_date'].transform('first')
df1 = df[(df['begin_date'] == end)]
#filter Health rows
df2 = df[(df['location'] == 'Health')]
#get filtered df together and remove duplicates, last reindex by all ID
#values for append missing ID rows
df3 = (pd.concat([df1, df2])
.drop_duplicates('ID')
.set_index('ID')['location']
.reindex(df['ID'].unique(), fill_value='Home')
.reset_index(name='first_site'))
print (df3)
ID first_site
0 1 rehab
1 2 nursing
2 3 Home
3 4 Health
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.