简体   繁体   English

对于给定的分组,Pandas将nan替换为平均值

[英]Pandas replace nan with mean value for a given grouping

I have a large dataset of the form: 我有一个大型的数据集:

    period_id  gic_subindustry_id  operating_mgn_fym5  operating_mgn_fym4  317        201509            25101010           13.348150           11.745965   
682        201509            20101010           10.228725           10.473917   
903        201509            20101010           NaN                 17.700966   
1057       201509            50101010           27.858305           28.378040   
1222       201509            25502020           15.598956           11.658813   
2195       201508            25502020           27.688324           22.969760   
2439       201508            45202020           NaN                 27.145216   
2946       201508            45102020           17.956425           18.327724 

In practice, I have thousands of values for each year going back 25 years, and multiple (10+) columns. 在实践中,我每年有数千个值可以追溯到25年,并且有多个(10+)列。

I am trying to replace the NaN values with the gic_industry_id median/mean value for that time period. 我试图用该时间段的gic_industry_id中值/平均值替换NaN值。

I tried something along the lines of 我尝试了一些类似的东西

df.fillna(df.groupby('period_id', 'gic_subindustry_id').transform('mean')), but this seemed to be painfully slow (I stopped it after several minutes). df.fillna(df.groupby('period_id','gic_subindustry_id')。transform('mean')),但这似乎很缓慢(我在几分钟后停止了它)。

It occurred to me that the reason it might be slow was due to re-calculating the mean for every NaN encountered. 在我看来,它可能缓慢的原因是由于重新计算每个遇到的NaN的平均值。 To get around this, I thought that calculating the mean at each period_id, and then replacing/mapping each NaN using this might be substantially faster. 为了解决这个问题,我认为计算每个period_id的均值,然后用这个替换/映射每个NaN可能要快得多。

means = df.groupby(['period_id', 'gic_subindustry_id']).apply(lambda x:x.mean())

Output: 输出:

                             operating_mgn_fym5  operating_mgn_fym4 operating_mgn_fym3 operating_mgn_fym2   
period_id gic_subindustry_id                                             
201509    45202030            1.622685  0.754661   0.755324  321.295665  
          45203010            1.447686  0.226571   0.334280   12.564398  
          45203015            0.733524  0.257581   0.345450   27.659407  
          45203020            1.322349  0.655481   0.468740   19.823722  
          45203030            1.461916  1.181407   1.487330   16.598534  
          45301010            2.074954  0.981030   0.841125   29.423161  
          45301020            2.621158  1.235087   1.550252   82.717147  

And indeed, this is much faster (30 - 60 seconds). 事实上,这要快得多(30-60秒)。

However, I am struggling to figure out how to map the NaNs to these means. 但是,我正在努力弄清楚如何将NaN映射到这些方法。 And, indeed, is this the 'correct' way of performing this mapping? 事实上,这是执行此映射的“正确”方式吗? Speed actually isn't of paramount importance, but < 60 seconds would be nice. 速度实际上并不是最重要的,但<60秒会很好。

You can use fillna using the result of group-by, provided the dataframes have the same structure (given by as_index=False ): 如果数据帧具有相同的结构(由as_index=False给出), as_index=False可以使用fillna结果使用fillna

df.fillna(df.groupby(['period_id', 'gic_subindustry_id'], as_index=False).mean())

#In [60]: df
#Out[60]: 
#   period_id  gic_subindustry_id  operating_mgn_fym5  operating_mgn_fym4
#0     201508            25502020           27.688324           22.969760
#1     201508            45102020           17.956425           18.327724
#2     201508            45202020                 NaN           27.145216
#3     201509            20101010           10.228725           14.087442
#4     201509            25101010           13.348150           11.745965
#5     201509            25502020           15.598956           11.658813
#6     201509            50101010           27.858305           28.378040
#7     201508            45102020           17.956425           18.327724

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM