有没有更好的方法来处理 NaN 值？

Question

我有一个输入 dataframe


    KPI_ID       KPI_Key1      KPI_Key2   KPI_Key3
       A        (C602+C603)     C601         75
       B        (C605+C606)     C602         NaN
       C          75          L239+C602      NaN
       D       (32*(C603+44))   75           NaN
       E         L239           NaN          C601

我有一个指标 df

              99    75  C604    C602    C601    C603    C605    C606    44  L239    32
PatientID                                           
1             1     0    1       0       1        0       0      0       1    0     1
2             0     0    0       0       0        0       1      1       0    0     0
3             1     1    1       1       0        1       1      1       1    1     1
4             0     0    0       0       0        1       0      1       0    1     0
5             1     0    1       1       1        1       0      1       1    1     1

来源：


    input_df = pd.DataFrame({'KPI_ID': ['A','B','C','D','E'], 'KPI_Key1': ['(C602+C603)','(C605+C606)','75','(32*(C603+44))','L239'] , 'KPI_Key2' : ['C601','C602','L239+C602','75',np.NaN] , 'KPI_Key3' : ['75',np.NaN,np.NaN,np.NaN,'C601']})
    
    indicator_df = pd.DataFrame({'PatientID': [1,2,3,4,5],'99' : ['1','0','1','0','1'],'75' : ['0','0','1','0','0'],'C604' : ['1','0','1','0','1'],'C602' : ['0','0','1','0','1'],'C601' : ['1','0','0','0','1'],'C603' : ['0','0','1','1','1'],'C605' : ['0','1','1','0','0'],'C606' : ['0','1','1','1','1'],'44' : ['1','0','1','0','1'],'L239' : ['0','0','1','1','1'], '32' : ['1','0','1','0','1'],}).set_index('PatientID')

我的目标是像这样创建一个 output df（通过根据 indicator_df 评估 input_df ）

final_out_df:

    PatientID    KPI_ID  KPI_Key1   KPI_Key2    KPI_Key3
    1              A         0         1          0
    2              A         0         0          0
    3              A         2         0          1
    4              A         1         0          0
    5              A         2         1          0
    1              B         0         0          0
    2              B         2         0          0
    3              B         2         1          0
    ...           ...      ...        ...         ...

我非常接近，我的逻辑工作正常，除了我无法处理 input_df 中的 NaN 值。我能够为 KPI_ID 'A' 生成 output，因为三个公式（KPI_Key1、KPI_Key2、KPI_Key3 为 'A' ) 是 null。但我无法为“B”生成它。 有什么我可以做的，而不是使用虚拟变量代替 NaN 并在 indicator_df 中创建该行吗？ 这是我到目前为止所做的：


           indicator_df = indicator_df.astype('int32')
            final_out_df = pd.DataFrame()
            out_df = pd.DataFrame(index=indicator_df.index)
            out_df.reset_index(level=0, inplace=True)
            final_out_df = pd.DataFrame()
            #running loop only for 'A' so it won't fail
            for i in range(0,len(input_df)-4):
                for j in ['KPI_Key1','KPI_Key2','KPI_Key3']:
                  exp = input_df[j].iloc[i]
                  temp_out_df=indicator_df.eval(re.sub(r'(\w+)', r'`\1`', exp)).reset_index(name=j)
                  out_df['KPI_ID'] =  input_df['KPI_ID'].iloc[i]
                  out_df = out_df.merge(temp_out_df, on='PatientID', how='left')
                final_out_df= final_out_df.append(out_df)
                out_df = pd.DataFrame(index=indicator_df.index)
                out_df.reset_index(level=0, inplace=True)

Answer 1

将NaN替换为None并创建局部变量字典以允许使用pd.eval进行正确评估：

def eval_kpi(row):
    kpi = row.filter(like='KPI_Key').fillna('None')
    return pd.Series(pd.eval(kpi, local_dict=row['local_vars']), index=kpi.index)


final_out_df = indicator_df.astype(int).apply(dict, axis=1) \
                           .rename('local_vars').reset_index() \
                           .merge(input_df, how='cross')

final_out_df.update(final_out_df.apply(eval_kpi, axis=1))
final_out_df = final_out_df.drop(columns='local_vars') \
                           .sort_values(['KPI_ID', 'PatientID']) \
                           .reset_index(drop=True)

Output：

>>> final_out_df
    PatientID KPI_ID KPI_Key1 KPI_Key2 KPI_Key3
0           1      A      0.0      1.0     75.0
1           2      A      0.0      0.0     75.0
2           3      A      2.0      0.0     75.0
3           4      A      1.0      0.0     75.0
4           5      A      2.0      1.0     75.0
5           1      B      0.0      0.0      NaN
6           2      B      2.0      0.0      NaN
7           3      B      2.0      1.0      NaN
8           4      B      1.0      0.0      NaN
9           5      B      1.0      1.0      NaN
10          1      C     75.0      0.0      NaN
11          2      C     75.0      0.0      NaN
12          3      C     75.0      2.0      NaN
13          4      C     75.0      1.0      NaN
14          5      C     75.0      2.0      NaN
15          1      D   1408.0     75.0      NaN
16          2      D   1408.0     75.0      NaN
17          3      D   1440.0     75.0      NaN
18          4      D   1440.0     75.0      NaN
19          5      D   1440.0     75.0      NaN
20          1      E      0.0      NaN      1.0
21          2      E      0.0      NaN      0.0
22          3      E      1.0      NaN      0.0
23          4      E      1.0      NaN      0.0
24          5      E      1.0      NaN      1.0

Answer 2

我能够通过添加来解决它：

if exp == exp:

在通过regex解析exp之前。

有没有更好的方法来处理 NaN 值？

问题描述

2 个解决方案

解决方案1
0 2021-10-07 10:31:51

解决方案2
0 2021-10-07 17:17:48

有没有更好的方法来处理 NaN 值？

问题描述

2 个解决方案

解决方案1 0 2021-10-07 10:31:51

解决方案2 0 2021-10-07 17:17:48

解决方案1
0 2021-10-07 10:31:51

解决方案2
0 2021-10-07 17:17:48