繁体   English   中英

在groupby之后填充组中的缺失行

[英]Filling missing rows in groups after groupby

我有一些SQL数据,我正在分组并执行一些聚合。 它工作得很好:

grouped = df.groupby(['a', 'b'])
agged = grouped.aggregate({
    c: [numpy.sum, numpy.mean, numpy.size],
    d: [numpy.sum, numpy.mean, numpy.size]
})

         c                         d 
         sum      mean   size      sum          mean size
a  b
25 20  107.0  0.804511  133.0  5328000  40060.150376  133
   21  110.0  0.774648  142.0  6031000  42471.830986  142
   23  126.0  0.792453  159.0  8795000  55314.465409  159
   24   72.0  0.947368   76.0  2920000  38421.052632   76
   25   54.0  0.818182   66.0  2570000  38939.393939   66
26 23  126.0  0.792453  159.0  8795000  55314.465409  159

但是我希望用0填充a=25但不在a=26所有行。 换句话说,像:

         c                         d 
         sum      mean   size      sum          mean size
a  b
25 20  107.0  0.804511  133.0  5328000  40060.150376  133
   21  110.0  0.774648  142.0  6031000  42471.830986  142
   23  126.0  0.792453  159.0  8795000  55314.465409  159
   24   72.0  0.947368   76.0  2920000  38421.052632   76
   25   54.0  0.818182   66.0  2570000  38939.393939   66
26 20      0         0      0        0             0    0
   21      0         0      0        0             0    0
   23  126.0  0.792453  159.0  8795000  55314.465409  159
   24      0         0      0        0             0    0
   25      0         0      0        0             0    0

我怎样才能做到这一点?

考虑数据帧df

df = pd.DataFrame(
    np.random.randint(10, size=(6, 6)),
    pd.MultiIndex.from_tuples(
        [(25, 20), (25, 21), (25, 23), (25, 24), (25, 25), (26, 23)],
        names=['a', 'b']
    ),
    pd.MultiIndex.from_product(
        [['c', 'd'], ['sum', 'mean', 'size']]
    )
)

        c             d          
      sum mean size sum mean size
a  b                             
25 20   8    3    5   5    0    2
   21   3    7    8   9    2    7
   23   2    1    3   2    5    4
   24   9    0    1   7    1    6
   25   1    9    3   5    8    8
26 23   8    8    4   8    0    5

您可以使用unstack(fill_value=0)然后stack从笛卡尔积中快速恢复所有缺失的行

df.unstack(fill_value=0).stack()

         c             d         
      mean size sum mean size sum
a  b                             
25 20    3    5   8    0    2   5
   21    7    8   3    2    7   9
   23    1    3   2    5    4   2
   24    0    1   9    1    6   7
   25    9    3   1    8    8   5
26 20    0    0   0    0    0   0
   21    0    0   0    0    0   0
   23    8    4   8    0    5   8
   24    0    0   0    0    0   0
   25    0    0   0    0    0   0

注意:使用fill_value=0保留了dtype int 如果没有它,当取消堆叠时,间隙会被NaN填充,而dtypes会转换为float

打印(DF)

           c                         d                   
         sum      mean   size      sum          mean size
a  b                                                     
25 20  107.0  0.804511  133.0  5328000  40060.150376  133
   21  110.0  0.774648  142.0  6031000  42471.830986  142
   23  126.0  0.792453  159.0  8795000  55314.465409  159
   24   72.0  0.947368   76.0  2920000  38421.052632   76
   25   54.0  0.818182   66.0  2570000  38939.393939   66
26 23  126.0  0.792453  159.0  8795000  55314.465409  159

我喜欢:

 df =  df.unstack().replace(np.nan,0).stack(-1)
 print(df)
                  c                           d                  
               mean   size    sum          mean   size        sum
    a  b                                                         
    25 20  0.804511  133.0  107.0  40060.150376  133.0  5328000.0
       21  0.774648  142.0  110.0  42471.830986  142.0  6031000.0
       23  0.792453  159.0  126.0  55314.465409  159.0  8795000.0
       24  0.947368   76.0   72.0  38421.052632   76.0  2920000.0
       25  0.818182   66.0   54.0  38939.393939   66.0  2570000.0
    26 20  0.000000    0.0    0.0      0.000000    0.0        0.0
       21  0.000000    0.0    0.0      0.000000    0.0        0.0
       23  0.792453  159.0  126.0  55314.465409  159.0  8795000.0
       24  0.000000    0.0    0.0      0.000000    0.0        0.0
       25  0.000000    0.0    0.0      0.000000    0.0        0.0

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM