如何使用groupby和過濾數據框來創建新列

Question

假設我有一個數據集，其中包含住在ICU的患者心率的時間序列。

我想補充一些入選標准，例如我只想考慮心率≥90的患者的ICU住院時間至少一小時。 如果一小時后（從> = 90值開始）第一次測量的心率未知，我們假設它高於90且包括ICU停留。

從對應於“至少1小時”時間跨度的第一次測量開始，應該包括該ICU停留的條目。

請注意，一旦ICU停留在其中，即使心率在某個時刻回落到90以下，它也不會再次被驅逐。

因此下面的數據框，其中“Icustay”對應於ICU中逗留的唯一ID和“小時”表示自入境以來在ICU中花費的時間

   Heart Rate  Hours  Icustay  Inclusion Criteria
0          79    0.0     1001                   0
1          91    1.5     1001                   0
2         NaN    2.7     1001                   0
3          85    3.4     1001                   0
4          90    0.0     2010                   0
5          94   29.4     2010                   0
6          68    0.0     3005                   0

應該成為

   Heart Rate  Hours  Icustay  Inclusion Criteria
0          79    0.0     1001                   0
1          91    1.5     1001                   1
2         NaN    2.7     1001                   1
3          85    3.4     1001                   1
4          90    0.0     2010                   1
5          94   29.4     2010                   1
6          68    0.0     3005                   0

我已經為此編寫了代碼，但它確實有效。 然而，它很慢，每個患者在處理我的整個數據集時可能需要幾秒鍾（實際上我的數據集包含的數據多於6個字段，但為了更好的可讀性，我對其進行了簡化）。 由於有40,000名患者，我想加快速度。

這是我目前正在使用的代碼，以及我上面介紹的玩具數據集。

import numpy as np
import pandas as pd

d = {'Icustay': [1001, 1001, 1001, 1001, 2010, 2010, 3005], 'Hours': [0, 1.5, 2.7, 3.4, 0, 29.4, 0],
     'Heart Rate': [79, 91, np.NaN, 85, 90, 94, 68], 'Inclusion Criteria':[0, 0, 0, 0, 0, 0, 0]}
all_records = pd.DataFrame(data=d)


for curr in np.unique(all_records['Icustay']):
    print(curr)
    curr_stay = all_records[all_records['Icustay']==curr]
    indexes = curr_stay['Hours'].index
    heart_rate_flag = False
    heart_rate_begin_time = 0
    heart_rate_begin_index = 0
    for i in indexes:
        if(curr_stay['Heart Rate'][i] >= 90 and not heart_rate_flag):
            heart_rate_flag = True
            heart_rate_begin_time = curr_stay['Hours'][i]
            heart_rate_begin_index = i
        elif(curr_stay['Heart Rate'][i] < 90):
            heart_rate_flag = False
        elif(heart_rate_flag and curr_stay['Hours'][i]-heart_rate_begin_time >= 1.0):
            all_records['Inclusion Criteria'].iloc[indexes[indexes>=heart_rate_begin_index]] = 1
            break

請注意，數據集按患者和小時排序。

有沒有辦法加快速度？ 我已經考慮過像group by這樣的內置函數，但我不確定它們會在這種特殊情況下有所幫助。

Answer 1

您可以在pandas中使用groupby和apply函數。 這也應該更快。

## fill missing values 
all_records['Heart Rate'].fillna(90, inplace=True)

## use apply
all_records['Inclusion Criteria'] = all_records.groupby('Icustay').apply(lambda x: (x['Heart Rate'].ge(90)) & (x['Hours'].ge(0))).values.astype(int)

print(all_records)

   Heart Rate  Hours  Icustay  Inclusion Criteria
0        79.0    0.0     1001                   0
1        91.0    1.5     1001                   1
2        97.0    2.7     1001                   1
3        90.0    3.4     1001                   1
4        90.0    0.0     2010                   1
5        94.0   29.4     2010                   1
6        68.0    0.0     3005                   0

Answer 2

這看起來有點難看，但它避免了循環，並且apply （這實際上只是一個循環下的循環）。 我還沒有測試過大型數據集，但我懷疑它會比你當前的代碼快得多。

首先，創建一些其他列，其中包含下一行/上一行的詳細信息，因為這可能與您的某些條件相關：

all_records['PrevHeartRate'] = all_records['Heart Rate'].shift()
all_records['NextHours'] = all_records['Hours'].shift(-1)
all_records['PrevICU'] = all_records['Icustay'].shift()
all_records['NextICU'] = all_records['Icustay'].shift(-1)

接下來，創建一個包含每個id的第一個合格記錄的DataFrame（由於涉及的邏輯量，這現在非常混亂）：

first_per_id = (all_records[((all_records['Heart Rate'] >= 90) |
                            ((all_records['Heart Rate'].isnull()) & 
                            (all_records['PrevHeartRate'] >= 90) &
                            (all_records['Icustay'] == all_records['PrevICU']))) &
                            ((all_records['Hours'] >= 1) |
                            ((all_records['NextHours'] >= 1) &
                            (all_records['NextICU'] == all_records['Icustay'])))]
                .drop_duplicates(subset='Icustay', keep='first')[['Icustay']]
                .reset_index()
                .rename(columns={'index': 'first_index'}))

這給了我們：

   first_index  Icustay
0            1     1001
1            4     2010

您現在可以從原始DataFrame中刪除所有新列：

all_records.drop(['PrevHeartRate', 'NextHours', 'PrevICU', 'NextICU'], axis=1, inplace=True)

然后我們可以將其與原始DataFrame合並：

new = pd.merge(all_records, first_per_id, how='left', on='Icustay')

贈送：

   Heart Rate  Hours  Icustay  Inclusion Criteria  first_index
0        79.0    0.0     1001                   0          1.0
1        91.0    1.5     1001                   0          1.0
2        97.0    2.7     1001                   0          1.0
3         NaN    3.4     1001                   0          1.0
4        90.0    0.0     2010                   0          4.0
5        94.0   29.4     2010                   0          4.0
6        68.0    0.0     3005                   0          NaN

從這里我們可以將'first_index'（這是該id的第一個合格索引）與實際索引進行比較：

new['Inclusion Criteria'] = new.index >= new['first_index']

這給出了：

       Heart Rate  Hours  Icustay  Inclusion Criteria  first_index
0        79.0    0.0     1001               False          1.0
1        91.0    1.5     1001                True          1.0
2        97.0    2.7     1001                True          1.0
3         NaN    3.4     1001                True          1.0
4        90.0    0.0     2010                True          4.0
5        94.0   29.4     2010                True          4.0
6        68.0    0.0     3005               False          NaN

從這里開始，我們只需要整理（將結果列轉換為整數，並刪除first_index列）：

new.drop('first_index', axis=1, inplace=True)
new['Inclusion Criteria'] = new['Inclusion Criteria'].astype(int)

給出最終的預期結果：

       Heart Rate  Hours  Icustay  Inclusion Criteria
0        79.0    0.0     1001                   0
1        91.0    1.5     1001                   1
2        97.0    2.7     1001                   1
3         NaN    3.4     1001                   1
4        90.0    0.0     2010                   1
5        94.0   29.4     2010                   1
6        68.0    0.0     3005                   0

如何使用groupby和過濾數據框來創建新列

問題描述

2 個解決方案

解決方案1
1 2018-05-04 15:27:26

解決方案2
1 已采納 2018-05-04 15:37:21

如何使用groupby和過濾數據框來創建新列

問題描述

2 個解決方案

解決方案1 1 2018-05-04 15:27:26

解決方案2 1 已采納 2018-05-04 15:37:21

解決方案1
1 2018-05-04 15:27:26

解決方案2
1 已采納 2018-05-04 15:37:21