如何創建一個函數並申請pandas中的每一行？

Question

我有一個復雜的功能，我寫作有困難。 基本上，我有一個df存儲醫療記錄，我需要確定一個人在出院日期之后去的第一個網站（我希望在初次入住后選擇第一個位置很簡單，但事實並非如此）。 df按ID分組。

有3個選項：（1）在一個組中，如果任何行的begin_date與第一行end_date匹配，則返回該位置作為第一個站點（如果有兩行符合此條件，則兩者都是正確的）。 （2）如果第一個選項不存在，那么如果存在患者location “健康”的實例，則返回“健康”。 （3）否則，如果條件1和2不存在，則返回'Home'

DF

ID    color  begin_date    end_date     location
1     red    2017-01-01    2017-01-07   initial
1     green  2017-01-05    2017-01-07   nursing
1     blue   2017-01-07    2017-01-15   rehab
1     red    2017-01-11    2017-01-22   Health
2     red    2017-02-22    2017-02-26   initial
2     green  2017-02-26    2017-02-28   nursing
2     blue   2017-02-26    2017-02-28   rehab
3     red    2017-03-11    2017-03-22   initial
4     red    2017-04-01    2017-04-07   initial
4     green  2017-04-05    2017-04-07   nursing
4     blue   2017-04-10    2017-04-15   Health

最終結果我附加到另一個df：

ID    first_site
1     rehab
2     nursing
3     home
4     Health

我的方法是使用這些條件編寫函數，然后使用apply()迭代每一行。

def conditions(x):
    if x['begin_date'].isin(x['end_date'].iloc[[0]]).any():
        return x['location'] 
    elif df[df['Health']] == True:
        return 'Health'
    else:
        return 'Home'

final = pd.DateFrame()
final['first'] = df.groupby('ID').apply(lambda x: conditions(x))

我收到一個錯誤：

TypeError: incompatible index of inserted column with frame index

Answer 1

我認為需要：

def conditions(x):
    #compare each group first
    val = x.loc[x['begin_date'] == x['end_date'].iloc[0], 'location']
    #if at least one match (not return empty `Series` get first value)
    if not val.empty:
        return val.iloc[0]
    #check if value Health
    elif (x['location']  == 'Health').any():
        return 'Health'
    else:
        return 'Home'

final = df.groupby('ID').apply(conditions).reset_index(name='first_site')
print (final)
   ID first_site
0   1      rehab
1   2    nursing
2   3       Home
3   4     Health

如果需要新列刪除reset_index並添加map或使用評論中的解決方案，謝謝@Oriol Mirosa：

final = df.groupby('ID').apply(conditions)
df['first_site'] = df['ID'].map(final)
print (df)
    ID  color begin_date   end_date location first_site
0    1    red 2017-01-01 2017-01-07  initial      rehab
1    1  green 2017-01-05 2017-01-07  nursing      rehab
2    1   blue 2017-01-07 2017-01-15    rehab      rehab
3    1    red 2017-01-11 2017-01-22   Health      rehab
4    2    red 2017-02-22 2017-02-26  initial    nursing
5    2  green 2017-02-26 2017-02-28  nursing    nursing
6    2   blue 2017-02-26 2017-02-28    rehab    nursing
7    3    red 2017-03-11 2017-03-22  initial       Home
8    4    red 2017-04-01 2017-04-07  initial     Health
9    4  green 2017-04-05 2017-04-07  nursing     Health
10   4   blue 2017-04-10 2017-04-15   Health     Health

顯然Apply很慢，如果性能很重要使用：

#first filter by end date for each group
end = df.groupby('ID')['end_date'].transform('first')
df1 = df[(df['begin_date'] == end)]

#filter Health rows
df2 = df[(df['location'] == 'Health')]
#get filtered df together and remove duplicates, last reindex by all ID
#values for append missing ID rows 
df3 = (pd.concat([df1, df2])
        .drop_duplicates('ID')
        .set_index('ID')['location']
        .reindex(df['ID'].unique(), fill_value='Home')
        .reset_index(name='first_site'))
print (df3)
   ID first_site
0   1      rehab
1   2    nursing
2   3       Home
3   4     Health

如何創建一個函數並申請pandas中的每一行？

問題描述

1 個解決方案

解決方案1
3 已采納 2018-04-08 14:58:12

如何創建一個函數並申請pandas中的每一行？

問題描述

1 個解決方案

解決方案1 3 已采納 2018-04-08 14:58:12

解決方案1
3 已采納 2018-04-08 14:58:12