简体   繁体   中英

How to use groupby and filter data frame to create a new column

Say I have a dataset which contains time series of the heart rates of patients who stay in an ICU.

I would like to add some inclusion criteria, for example I would only like to consider the ICU stays of patients whose heart rate was >= 90 for at least one hour. If the heart rate at the first measurement after one hour (beginning at the >=90 value) is unknown, we assume it's above 90 and include that ICU stay.

The entries of that ICU stay should be included starting from the first measurement corresponding to the ”at least 1 hour” timespan.

Note that once the ICU stay is included, it never gets exlcuded again, even if the heart rate goes back down below 90 at some point.

So the dataframe below, where "Icustay" corresponds to the unique ID of a stay in the ICU and "Hours" indicates the time spent in the ICU since entry

   Heart Rate  Hours  Icustay  Inclusion Criteria
0          79    0.0     1001                   0
1          91    1.5     1001                   0
2         NaN    2.7     1001                   0
3          85    3.4     1001                   0
4          90    0.0     2010                   0
5          94   29.4     2010                   0
6          68    0.0     3005                   0

Should become

   Heart Rate  Hours  Icustay  Inclusion Criteria
0          79    0.0     1001                   0
1          91    1.5     1001                   1
2         NaN    2.7     1001                   1
3          85    3.4     1001                   1
4          90    0.0     2010                   1
5          94   29.4     2010                   1
6          68    0.0     3005                   0

I have written code for this, and it works. However it is pretty slow, it can take up to a few seconds per patient when working with my entire dataset (in reality my dataset contains more data than just there 6 fields, but I have simplified it for better readability). Since there are 40'000 patients, I would like to speed this up.

This is the code I'm currently using, along with the toy dataset I have presented above.

import numpy as np
import pandas as pd

d = {'Icustay': [1001, 1001, 1001, 1001, 2010, 2010, 3005], 'Hours': [0, 1.5, 2.7, 3.4, 0, 29.4, 0],
     'Heart Rate': [79, 91, np.NaN, 85, 90, 94, 68], 'Inclusion Criteria':[0, 0, 0, 0, 0, 0, 0]}
all_records = pd.DataFrame(data=d)


for curr in np.unique(all_records['Icustay']):
    print(curr)
    curr_stay = all_records[all_records['Icustay']==curr]
    indexes = curr_stay['Hours'].index
    heart_rate_flag = False
    heart_rate_begin_time = 0
    heart_rate_begin_index = 0
    for i in indexes:
        if(curr_stay['Heart Rate'][i] >= 90 and not heart_rate_flag):
            heart_rate_flag = True
            heart_rate_begin_time = curr_stay['Hours'][i]
            heart_rate_begin_index = i
        elif(curr_stay['Heart Rate'][i] < 90):
            heart_rate_flag = False
        elif(heart_rate_flag and curr_stay['Hours'][i]-heart_rate_begin_time >= 1.0):
            all_records['Inclusion Criteria'].iloc[indexes[indexes>=heart_rate_begin_index]] = 1
            break

Note that the dataset is ordered by patient and hours.

Is there a way to speed this up? I have thought about built in functions like group by, but I'm not sure they would help in this particular case.

You can use groupby and apply function in pandas. This should be faster also.

## fill missing values 
all_records['Heart Rate'].fillna(90, inplace=True)

## use apply
all_records['Inclusion Criteria'] = all_records.groupby('Icustay').apply(lambda x: (x['Heart Rate'].ge(90)) & (x['Hours'].ge(0))).values.astype(int)

print(all_records)

   Heart Rate  Hours  Icustay  Inclusion Criteria
0        79.0    0.0     1001                   0
1        91.0    1.5     1001                   1
2        97.0    2.7     1001                   1
3        90.0    3.4     1001                   1
4        90.0    0.0     2010                   1
5        94.0   29.4     2010                   1
6        68.0    0.0     3005                   0

This will look a bit ugly but it avoids loops, and apply (which is essentially just a loop under the hood). I haven't tested on a large dataset but I suspect it will be a lot faster than your current code.

First, create some additional columns which contain details of the next/previous lines, since this can be relevant for some of your conditions:

all_records['PrevHeartRate'] = all_records['Heart Rate'].shift()
all_records['NextHours'] = all_records['Hours'].shift(-1)
all_records['PrevICU'] = all_records['Icustay'].shift()
all_records['NextICU'] = all_records['Icustay'].shift(-1)

Next, create a DataFrame containing the first qualifying record per id (this is now very messy due to the amount of logic involved):

first_per_id = (all_records[((all_records['Heart Rate'] >= 90) |
                            ((all_records['Heart Rate'].isnull()) & 
                            (all_records['PrevHeartRate'] >= 90) &
                            (all_records['Icustay'] == all_records['PrevICU']))) &
                            ((all_records['Hours'] >= 1) |
                            ((all_records['NextHours'] >= 1) &
                            (all_records['NextICU'] == all_records['Icustay'])))]
                .drop_duplicates(subset='Icustay', keep='first')[['Icustay']]
                .reset_index()
                .rename(columns={'index': 'first_index'}))

This gives us:

   first_index  Icustay
0            1     1001
1            4     2010

You can drop all the new columns from the original DataFrame now:

all_records.drop(['PrevHeartRate', 'NextHours', 'PrevICU', 'NextICU'], axis=1, inplace=True)

We can then merge this with the original DataFrame:

new = pd.merge(all_records, first_per_id, how='left', on='Icustay')

Giving:

   Heart Rate  Hours  Icustay  Inclusion Criteria  first_index
0        79.0    0.0     1001                   0          1.0
1        91.0    1.5     1001                   0          1.0
2        97.0    2.7     1001                   0          1.0
3         NaN    3.4     1001                   0          1.0
4        90.0    0.0     2010                   0          4.0
5        94.0   29.4     2010                   0          4.0
6        68.0    0.0     3005                   0          NaN

From here we can compare 'first_index' (which is the first qualifying index for that id), to the actual index:

new['Inclusion Criteria'] = new.index >= new['first_index']

This gives:

       Heart Rate  Hours  Icustay  Inclusion Criteria  first_index
0        79.0    0.0     1001               False          1.0
1        91.0    1.5     1001                True          1.0
2        97.0    2.7     1001                True          1.0
3         NaN    3.4     1001                True          1.0
4        90.0    0.0     2010                True          4.0
5        94.0   29.4     2010                True          4.0
6        68.0    0.0     3005               False          NaN

From here, we just need to tidy up (convert results column to integer, and delete first_index column):

new.drop('first_index', axis=1, inplace=True)
new['Inclusion Criteria'] = new['Inclusion Criteria'].astype(int)

Giving the final desired results:

       Heart Rate  Hours  Icustay  Inclusion Criteria
0        79.0    0.0     1001                   0
1        91.0    1.5     1001                   1
2        97.0    2.7     1001                   1
3         NaN    3.4     1001                   1
4        90.0    0.0     2010                   1
5        94.0   29.4     2010                   1
6        68.0    0.0     3005                   0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM