[英]Python Feature Engineering Patients Data
Hi I'm trying to feature engineer a Patient dataset from movement level to patient level.嗨,我正在尝试对从运动级别到患者级别的患者数据集进行特征工程。
Original df looks like this:原始 df 如下所示:
Conditions:条件:
1) Create Last Platelets Change
col - For CaseNo
that encounters the Category
value 'ICU', take the Platelets
change before 'ICU' value (189-180 for CaseNo 1), else take the latest Platelets
change (256-266 for CaseNo 2). 1) Create
Last Platelets Change
col - 对于遇到Category
值“ICU”的CaseNo
,取“ICU”值之前的Platelets
变化(CaseNo 1 为 189-180),否则取最新的Platelets
变化(CaseNo 2 为 256-266 )。
2) Create Platelets_Pattern
col - For CaseNo
that encounters the Category
value 'ICU', pivot all the Platelets
values from start till before 'ICU' value. 2) 创建
Platelets_Pattern
col - 对于遇到Category
值“ICU”的CaseNo
,pivot 从开始到“ICU”值之前的所有Platelets
值。 Else pivot all Platelets
values from start to end.否则 pivot 从头到尾的所有
Platelets
值。
3)Create Last Platelets Count
col - For CaseNo
that encounters the Category
value 'ICU', take the last Platelets
value before 'ICU' encounter. 3) Create
Last Platelets Count
col - 对于遇到Category
值“ICU”的CaseNo
,取在遇到“ICU”之前的最后一个Platelets
值。 Else take the last Platelets
value.否则取最后一个
Platelets
值。
Expected Outcome :预期结果:
How do I go about this in Python?我如何在 Python 中对此进行 go? The 'ICU' value part is tripping me up.
“重症监护室”的价值部分让我绊倒了。
Code for df : df 的代码:
df = pd.DataFrame({'CaseNo':[1,1,1,1,2,2,2,2],
'Movement_Sequence_No':[1,2,3,4,1,2,3,4],
'Movement_Start_Date':['2020-02-09 22:17:00','2020-02-10 17:19:41','2020-02-17 08:04:19',
'2020-02-18 11:22:52','2020-02-12 23:00:00','2020-02-24 10:26:35',
'2020-03-03 17:50:00','2020-03-17 08:24:19'],
'Movement_End_Date':['2020-02-10 17:19:41','2020-02-17 08:04:19','2020-02-18 11:22:52',
'2020-02-25 13:55:37','2020-02-24 10:26:35','2020-03-03 17:50:00',
'2222-12-31 23:00:00','2020-03-18 18:50:00'],
'Category':['A','A','ICU','A','B','B','B','B'],
'RequestDate':['2020-02-10 16:00:00','2020-02-16 13:04:20','2020-02-18 07:11:11','2020-02-21 21:30:30',
'2020-02-13 22:00:00','NA','2020-03-15 09:40:00','2020-03-18 15:10:10'],
'Platelets':['180','189','190','188','328','NA','266','256'],
'Age':['65','65','65','65','45','45','45','45']})
You could use a groupby
to group the dataframe on CaseNo
and then apply
a custom function on each group to produce the expected values.您可以使用
groupby
对 CaseNo 上的CaseNo
进行分组,然后在每个组上apply
自定义 function 以产生预期值。
For each group, you should first find the index of the row before a ICU category if any to find the list of Platelets to process (do not forget to remove NA
values).对于每个组,您应该首先找到 ICU 类别之前的行的索引(如果有)以找到要处理的血小板列表(不要忘记删除
NA
值)。 Then just do trivial operations to compute the results and return a Series
per group:然后只需做一些简单的操作来计算结果并为每组返回一个
Series
:
def process(x):
age = x.at[x.first_valid_index(), 'Age'] # store age
# compute index of last row before Category ICU (or get None)
ix = x[x['Category'].shift(-1) == 'ICU'].first_valid_index()
# get list of non NA Platelets before ix (get all if ix is None)
platelets = [i for i in x.loc[:ix,'Platelets'] if i != 'NA']
# initialize change and count to np.nan (in case less than 2 Platelets)
change = count = np.nan
try:
count = platelets[-1]
change = int(platelets[-1]) - int(platelets[-2])
except IndexError: # if less than 2 platelets, values will stay at NaN
pass
return pd.Series({'Last Platelets Change': change,
'Platelets_Pattern': ','.join(platelets),
'Last Platelets Count': count,
'Age': age})
result = df.groupby('CaseNo').apply(process).reset_index()
With you sample df
, it gives as expected:使用您的示例
df
,它会按预期提供:
CaseNo Last Platelets Change Platelets_Pattern Last Platelets Count Age
0 1 9 180,189 189 65
1 2 -10 328,266,256 256 45
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.