Pandas 用来自不同数据帧的数据替换所有 NaN 值

Question

I'm pretty new to Pandas and am kind of stucked with a problem to replace Nan-Values with median values from a different dataframe.我对 Pandas 很陌生，并且遇到了一个问题，即用来自不同数据帧的中值替换 Nan-Values。 The median dataframe has a different form, because I had to group the original df to get the medians.中位数数据框具有不同的形式，因为我必须对原始 df 进行分组以获得中位数。

My main dataframe df1 looks something like this:我的主要数据框 df1 看起来像这样：

      permno    yyyymm  BookLeverage Cash   RoE        ShareIss1Y   ShareIss5Y   SP         date        industry_id     STreversal  Price         Size      ret
541     10006   197101  -1.907577   NaN     0.114616    0.000000    0.051689    1.197606    1971-01-29  37              -4.383562   -3.863358   -12.496377  0.043836
542     10006   197102  -1.907577   NaN     0.114616    0.000000    0.051689    1.220021    1971-02-26  37              0.577428    -3.844814   -12.477833  -0.005774
543     10006   197103  -1.907577   NaN     0.114616    0.000000    0.051689    1.118353    1971-03-31  37              -9.090909   -3.931826   -12.564844  0.090909
544     10006   197104  -1.907577   NaN     0.114616    0.000000    0.051689    NaN         1971-04-30  37              -16.176471  -4.081766   -12.714785  0.161765
545     10006   197105  -1.907577   NaN     0.114616    0.000000    0.051689    1.025366    1971-05-28  37              5.105485    -4.018633   -12.651651  -0.051055

Then I created a new dataframe df2 in which I grouped the former df by the yyyymm and industry_id column, and got the median for each time-industry panel.然后我创建了一个新的数据框 df2，其中我将前一个 df 按yyyymm和industry_id ID 列分组，并获得了每个时间行业面板的中位数。

The median df2 looks something like this:中位数 df2 看起来像这样：

                     permno  BookLeverage  Cash       RoE  ShareIss1Y  \
yyyymm industry_id                                                      
197101 01           40957.5     -2.451327   NaN  0.015212   -0.306936   
       10           19254.0     -1.300565   NaN  0.123353   -0.002747   
       12           33081.5     -2.102402   NaN -0.001043   -0.255756   
       13           26470.0     -2.028418   NaN  0.116907   -0.005262   
       14           17830.0     -1.266574   NaN  0.110059   -0.000193   
...                     ...           ...   ...       ...         ...   
202112 80           78633.0     -3.037694   NaN  0.195342         NaN   
       82           52123.0     -3.093551   NaN  0.017580         NaN   
       83           13739.0     -2.802522   NaN  0.021025         NaN   
       87           78667.5     -3.103168   NaN  0.104524         NaN   
       97           91547.0     -3.054443   NaN  0.162610         NaN   

                    ShareIss5Y        SP  STreversal     Price       Size  \
yyyymm industry_id                                                          
197101 01            -7.591944  5.439985   -9.998244 -2.684046 -11.483201   
       10            -1.432833  0.517484   -4.504504 -3.367296 -11.826440   
       12           -20.622667  2.264890  -22.648810 -2.873900 -11.501783   
       13            -0.257821  0.752112   -5.429864 -3.607534 -12.362360   
       14            -0.223948  0.636665  -16.075773 -2.729726 -11.386150   
...                        ...       ...         ...       ...        ...   
202112 80                  NaN       NaN  -10.960198 -4.539740 -16.024733   
       82                  NaN       NaN   -1.664319 -2.740474 -13.882130   
       83                  NaN       NaN   -2.383083 -4.835329 -15.843560   
       87                  NaN       NaN   -5.109321 -4.585741 -15.844537   
       97                  NaN       NaN   -1.535659 -4.487512 -16.339328   

                         ret  
yyyymm industry_id            
197101 01           0.099982  
       10           0.045045  
       12           0.226488  
       13           0.054299  
       14           0.160758  
...                      ...  
202112 80           0.109602  
       82           0.016643  
       83           0.023831  
       87           0.051093  
       97           0.015357

What I'm now trying to achieve, is to fill the NaN-values in the df1 with the corresponding value from df2.我现在想要实现的是用 df2 中的相应值填充 df1 中的 NaN 值。 So that for example the SP column in row 544 would get the value which is in df2 at yyyymm 197104 with industry_id 37.因此，例如，第 544 行中的 SP 列将获得 df2 中yyyymm 197104 中的值， industry_id ID 为 37。

I tried to map over all rows and inside that over all columns and replace the NaN-values, but this broke my dataframe:我试图映射所有行和所有列内部并替换 NaN 值，但这破坏了我的数据框：

def fill_nan_with_median(row):
    date = int(row['yyyymm'])
    industry = row['industry_id']


    for label, column in row.items():
        if column == np.nan:
            median = df_median.loc[(date, industry), label]
            df_1.loc[index, label] = median
    

for index, row in df_1.iterrows():
    fill_nan_with_median(row)

Answer 1

This is all done without data, therefore you may need to change something (hopefully not),这一切都是在没有数据的情况下完成的，因此您可能需要更改某些内容（希望不会），

df_grouped_median = df1.groupby(['yyyymm', 'industry_id'], as_index=False).SP.median().rename(
    columns={"SP":"median"})
df = df.merge(df_grouped_median, on=['yyyymm', 'industry_id'], how='left')
df['SP'].fillna(df['median'])

Answer 2

This answer takes a table lookup approach.这个答案采用表查找方法。 For NaNs in the SP column it does a lookup into df2 for the median SP value.对于SP列中的NaNs ，它会在df2中查找SP中值。 This answer also assumes that yyyymm and industry_id are strings and not numeric.此答案还假设yyyymm和industry_id ID 是字符串而不是数字。

df1.apply(lambda x: x['SP'] if x['SP']==x['SP'] else df2.at[(x['yyyymm'],x['industry_id']),'SP'] , axis=1)

541    1.197606
542    1.220021
543    1.118353
544    0.636665
545    1.025366

Note that non-NaNs are detected by the weird looking x['SP']==x['SP'] leveraging the fact that NaN != NaN .请注意，非 NaN 是由看起来很奇怪的x['SP']==x['SP']检测到的，它利用了NaN != NaN的事实。

Your df1 was used along with a df2 that I created:您的df1与我创建的df2一起使用：

                          SP
yyyymm industry_id          
197104 01           5.439985
       10           0.517484
       12           2.264890
       13           0.752112
       37           0.636665

All that you need to do after that is assign that back to the df1 frame:之后您需要做的就是将其分配回df1框架：

df1.assign(SP=df1.apply(lambda x: x['SP'] if x['SP']==x['SP'] else df2.at[(x['yyyymm'],x['industry_id']),'SP'] , axis=1))

     permno  yyyymm  BookLeverage  Cash       RoE  ShareIss1Y  ShareIss5Y  \
541   10006  197101     -1.907577   NaN  0.114616         0.0    0.051689   
542   10006  197102     -1.907577   NaN  0.114616         0.0    0.051689   
543   10006  197103     -1.907577   NaN  0.114616         0.0    0.051689   
544   10006  197104     -1.907577   NaN  0.114616         0.0    0.051689   
545   10006  197105     -1.907577   NaN  0.114616         0.0    0.051689   

           SP        date industry_id  STreversal     Price       Size  \
541  1.197606  1971-01-29          37   -4.383562 -3.863358 -12.496377   
542  1.220021  1971-02-26          37    0.577428 -3.844814 -12.477833   
543  1.118353  1971-03-31          37   -9.090909 -3.931826 -12.564844   
544  0.636665  1971-04-30          37  -16.176471 -4.081766 -12.714785   
545  1.025366  1971-05-28          37    5.105485 -4.018633 -12.651651   

          ret  
541  0.043836  
542 -0.005774  
543  0.090909  
544  0.161765  
545 -0.051055

Or by:或通过：

df1['SP'] = df1.apply(lambda x: x['SP'] if x['SP']==x['SP'] else df2.at[(x['yyyymm'],x['industry_id']),'SP'] , axis=1)

Pandas 用来自不同数据帧的数据替换所有 NaN 值

问题描述

2 个解决方案

解决方案1
1 2022-06-17 16:13:27

解决方案2
0 已采纳 2022-06-17 16:39:25

Pandas 用来自不同数据帧的数据替换所有 NaN 值

问题描述

2 个解决方案

解决方案1 1 2022-06-17 16:13:27

解决方案2 0 已采纳 2022-06-17 16:39:25

解决方案1
1 2022-06-17 16:13:27

解决方案2
0 已采纳 2022-06-17 16:39:25