[英]is there a better way to handle NaN values?
我有一個輸入 dataframe
KPI_ID KPI_Key1 KPI_Key2 KPI_Key3 A (C602+C603) C601 75 B (C605+C606) C602 NaN C 75 L239+C602 NaN D (32*(C603+44)) 75 NaN E L239 NaN C601
我有一個指標 df
99 75 C604 C602 C601 C603 C605 C606 44 L239 32 PatientID 1 1 0 1 0 1 0 0 0 1 0 1 2 0 0 0 0 0 0 1 1 0 0 0 3 1 1 1 1 0 1 1 1 1 1 1 4 0 0 0 0 0 1 0 1 0 1 0 5 1 0 1 1 1 1 0 1 1 1 1
來源:
input_df = pd.DataFrame({'KPI_ID': ['A','B','C','D','E'], 'KPI_Key1': ['(C602+C603)','(C605+C606)','75','(32*(C603+44))','L239'] , 'KPI_Key2' : ['C601','C602','L239+C602','75',np.NaN] , 'KPI_Key3' : ['75',np.NaN,np.NaN,np.NaN,'C601']}) indicator_df = pd.DataFrame({'PatientID': [1,2,3,4,5],'99' : ['1','0','1','0','1'],'75' : ['0','0','1','0','0'],'C604' : ['1','0','1','0','1'],'C602' : ['0','0','1','0','1'],'C601' : ['1','0','0','0','1'],'C603' : ['0','0','1','1','1'],'C605' : ['0','1','1','0','0'],'C606' : ['0','1','1','1','1'],'44' : ['1','0','1','0','1'],'L239' : ['0','0','1','1','1'], '32' : ['1','0','1','0','1'],}).set_index('PatientID')
我的目標是像這樣創建一個 output df(通過根據 indicator_df 評估 input_df )
final_out_df: PatientID KPI_ID KPI_Key1 KPI_Key2 KPI_Key3 1 A 0 1 0 2 A 0 0 0 3 A 2 0 1 4 A 1 0 0 5 A 2 1 0 1 B 0 0 0 2 B 2 0 0 3 B 2 1 0 ... ... ... ... ...
我非常接近,我的邏輯工作正常,除了我無法處理 input_df 中的 NaN 值。我能夠為 KPI_ID 'A' 生成 output,因為三個公式(KPI_Key1、KPI_Key2、KPI_Key3 為 'A' ) 是 null。但我無法為“B”生成它。 有什么我可以做的,而不是使用虛擬變量代替 NaN 並在 indicator_df 中創建該行嗎? 這是我到目前為止所做的:
indicator_df = indicator_df.astype('int32') final_out_df = pd.DataFrame() out_df = pd.DataFrame(index=indicator_df.index) out_df.reset_index(level=0, inplace=True) final_out_df = pd.DataFrame() #running loop only for 'A' so it won't fail for i in range(0,len(input_df)-4): for j in ['KPI_Key1','KPI_Key2','KPI_Key3']: exp = input_df[j].iloc[i] temp_out_df=indicator_df.eval(re.sub(r'(\w+)', r'`\1`', exp)).reset_index(name=j) out_df['KPI_ID'] = input_df['KPI_ID'].iloc[i] out_df = out_df.merge(temp_out_df, on='PatientID', how='left') final_out_df= final_out_df.append(out_df) out_df = pd.DataFrame(index=indicator_df.index) out_df.reset_index(level=0, inplace=True)
將NaN
替換為None
並創建局部變量字典以允許使用pd.eval
進行正確評估:
def eval_kpi(row):
kpi = row.filter(like='KPI_Key').fillna('None')
return pd.Series(pd.eval(kpi, local_dict=row['local_vars']), index=kpi.index)
final_out_df = indicator_df.astype(int).apply(dict, axis=1) \
.rename('local_vars').reset_index() \
.merge(input_df, how='cross')
final_out_df.update(final_out_df.apply(eval_kpi, axis=1))
final_out_df = final_out_df.drop(columns='local_vars') \
.sort_values(['KPI_ID', 'PatientID']) \
.reset_index(drop=True)
Output:
>>> final_out_df
PatientID KPI_ID KPI_Key1 KPI_Key2 KPI_Key3
0 1 A 0.0 1.0 75.0
1 2 A 0.0 0.0 75.0
2 3 A 2.0 0.0 75.0
3 4 A 1.0 0.0 75.0
4 5 A 2.0 1.0 75.0
5 1 B 0.0 0.0 NaN
6 2 B 2.0 0.0 NaN
7 3 B 2.0 1.0 NaN
8 4 B 1.0 0.0 NaN
9 5 B 1.0 1.0 NaN
10 1 C 75.0 0.0 NaN
11 2 C 75.0 0.0 NaN
12 3 C 75.0 2.0 NaN
13 4 C 75.0 1.0 NaN
14 5 C 75.0 2.0 NaN
15 1 D 1408.0 75.0 NaN
16 2 D 1408.0 75.0 NaN
17 3 D 1440.0 75.0 NaN
18 4 D 1440.0 75.0 NaN
19 5 D 1440.0 75.0 NaN
20 1 E 0.0 NaN 1.0
21 2 E 0.0 NaN 0.0
22 3 E 1.0 NaN 0.0
23 4 E 1.0 NaN 0.0
24 5 E 1.0 NaN 1.0
我能夠通過添加來解決它:
if exp == exp:
在通過regex
解析exp
之前。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.