Python 特征工程患者数据

Question

Hi I'm trying to feature engineer a Patient dataset from movement level to patient level.嗨，我正在尝试对从运动级别到患者级别的患者数据集进行特征工程。

Original df looks like this:原始 df 如下所示：

Conditions:条件：
1) Create Last Platelets Change col - For CaseNo that encounters the Category value 'ICU', take the Platelets change before 'ICU' value (189-180 for CaseNo 1), else take the latest Platelets change (256-266 for CaseNo 2). 1) Create Last Platelets Change col - 对于遇到Category值“ICU”的CaseNo ，取“ICU”值之前的Platelets变化（CaseNo 1 为 189-180），否则取最新的Platelets变化（CaseNo 2 为 256-266 ）。

2) Create Platelets_Pattern col - For CaseNo that encounters the Category value 'ICU', pivot all the Platelets values from start till before 'ICU' value. 2) 创建Platelets_Pattern col - 对于遇到Category值“ICU”的CaseNo ，pivot 从开始到“ICU”值之前的所有Platelets值。 Else pivot all Platelets values from start to end.否则 pivot 从头到尾的所有Platelets值。

3)Create Last Platelets Count col - For CaseNo that encounters the Category value 'ICU', take the last Platelets value before 'ICU' encounter. 3) Create Last Platelets Count col - 对于遇到Category值“ICU”的CaseNo ，取在遇到“ICU”之前的最后一个Platelets值。 Else take the last Platelets value.否则取最后一个Platelets值。

Expected Outcome :预期结果：

How do I go about this in Python?我如何在 Python 中对此进行 go？ The 'ICU' value part is tripping me up. “重症监护室”的价值部分让我绊倒了。

Code for df : df 的代码：

df = pd.DataFrame({'CaseNo':[1,1,1,1,2,2,2,2],
                    'Movement_Sequence_No':[1,2,3,4,1,2,3,4],
                    'Movement_Start_Date':['2020-02-09 22:17:00','2020-02-10 17:19:41','2020-02-17 08:04:19',
                                           '2020-02-18 11:22:52','2020-02-12 23:00:00','2020-02-24 10:26:35',
                                           '2020-03-03 17:50:00','2020-03-17 08:24:19'],
                    'Movement_End_Date':['2020-02-10 17:19:41','2020-02-17 08:04:19','2020-02-18 11:22:52',
                                         '2020-02-25 13:55:37','2020-02-24 10:26:35','2020-03-03 17:50:00',
                                         '2222-12-31 23:00:00','2020-03-18 18:50:00'],
                    'Category':['A','A','ICU','A','B','B','B','B'],
                    'RequestDate':['2020-02-10 16:00:00','2020-02-16 13:04:20','2020-02-18 07:11:11','2020-02-21 21:30:30',
                                   '2020-02-13 22:00:00','NA','2020-03-15 09:40:00','2020-03-18 15:10:10'],
                    'Platelets':['180','189','190','188','328','NA','266','256'],
                    'Age':['65','65','65','65','45','45','45','45']})

Answer 1

You could use a groupby to group the dataframe on CaseNo and then apply a custom function on each group to produce the expected values.您可以使用groupby对 CaseNo 上的CaseNo进行分组，然后在每个组上apply自定义 function 以产生预期值。

For each group, you should first find the index of the row before a ICU category if any to find the list of Platelets to process (do not forget to remove NA values).对于每个组，您应该首先找到 ICU 类别之前的行的索引（如果有）以找到要处理的血小板列表（不要忘记删除NA值）。 Then just do trivial operations to compute the results and return a Series per group:然后只需做一些简单的操作来计算结果并为每组返回一个Series ：

def process(x):
    age = x.at[x.first_valid_index(), 'Age']  # store age
    # compute index of last row before Category ICU (or get None)
    ix = x[x['Category'].shift(-1) == 'ICU'].first_valid_index()
    # get list of non NA Platelets before ix (get all if ix is None)
    platelets = [i for i in x.loc[:ix,'Platelets'] if i != 'NA']
    # initialize change and count to np.nan (in case less than 2 Platelets)
    change = count = np.nan
    try:
        count = platelets[-1]
        change = int(platelets[-1]) - int(platelets[-2])
    except IndexError:   # if less than 2 platelets, values will stay at NaN
        pass
    return pd.Series({'Last Platelets Change': change,
              'Platelets_Pattern': ','.join(platelets),
              'Last Platelets Count': count,
              'Age': age})

result = df.groupby('CaseNo').apply(process).reset_index()

With you sample df , it gives as expected:使用您的示例df ，它会按预期提供：

   CaseNo  Last Platelets Change Platelets_Pattern Last Platelets Count Age
0       1                      9           180,189                  189  65
1       2                    -10       328,266,256                  256  45

Python 特征工程患者数据

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-05-23 10:30:57

Python 特征工程患者数据

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-05-23 10:30:57

解决方案1
1 已采纳 2020-05-23 10:30:57