Pandas GroupBy-如何保持行数不超过累积总和的百分比？

Question

我有一个没有顺序的数据框：

df
     A   B  Moves
0   E1  E2     10
1   E1  E3     20
2   E1  E4     15
3   E2  E1      9
4   E2  E3      8
5   E2  E4      7
6   E3  E1     30
7   E3  E2     32
8   E3  E4     40
9   E4  E1      5
10  E4  E2     20
11  E4  E3      3

我想返回B行，直到它们的累加总和等于A中B每个分组的总总Moves最小百分比为止（我排在第一位）。

达到百分比阈值后，我将停止记录行（累积总和）。 该过程必须是“贪婪的”，因为如果某行将其超过期望的百分比，它将包括该行。

如果总数的最小百分比是50％，那么我想先返回：

期望的输出

     A   B  Moves
    E1  E3     20
    E1  E4     15
    E2  E1      9
    E2  E3      8
    E3  E4     40
    E3  E2     32
    E4  E2     20

然后，我想从这个问题中使用df.groupby(...).apply(list)提取每个分组的行名

A     Most_Moved
E1      [E3, E4] 
E2      [E1, E3]
E3      [E4, E2]
E4          [E2]

我尝试过的

我可以在这个问题和这个问题中返回使用cumsum订购的cumsum ：

df.groupby(by=['A','B']).sum().groupby(level=[0]).cumsum()[::-1]

       Moves
A  B        
E4 E3     28
   E2     25
   E1      5
E3 E4    102
   E2     62
   E1     30
E2 E4     24
   E3     17
   E1      9
E1 E4     45
   E3     30
   E2     10

我可以分别返回每个组的总移动量（总和）：

df.groupby(by="A").sum()

    Moves
A        
E1     45
E2     24
E3    102
E4     28

从这个问题和这个问题，我可以返回每一行作为该类别总和的百分比：

df.groupby(by=["A"])["Moves"].apply(lambda x: 100 * x / float(x.sum()))

0     22.222222
1     44.444444
2     33.333333
3     37.500000
4     33.333333
5     29.166667
6     29.411765
7     31.372549
8     39.215686
9     17.857143
10    71.428571
11    10.714286

什么不起作用

但是，如果将这些结合起来，它将评估总行数的百分比：

df.groupby(by=["A", "B"])["Moves"].agg({"Total_Moves":sum}).sort_values("Total_Moves", ascending=False).apply(lambda x: 100 * x / float(x.sum()))

       Total_Moves
A  B              
E3 E4    20.100503
   E2    16.080402
   E1    15.075377
E1 E3    10.050251
E4 E2    10.050251
E1 E4     7.537688
   E2     5.025126
E2 E1     4.522613
   E3     4.020101
   E4     3.517588
E4 E1     2.512563
   E3     1.507538

这将评估整个数据框（而不是单个组）中的百分比。

我只是不知道如何将它们拼凑起来以获得我的输出。

任何帮助表示赞赏。

Answer 1

您可以将groupby.apply与自定义功能一起使用

def select(group, pct=50):
    # print(group)
    moves = group['Moves'].sort_values(ascending=False)
    cumsum = moves.cumsum() / moves.sum()
    # print(cumsum)
    # `cumsum` is the cumulative contribution of the sorted moves
    idx = len(cumsum[cumsum < pct/100]) + 1
    # print(idx)
    # `idx` is the first index of the move which has a cumulative sum of `pct` or higher
    idx = moves.index[:idx]  
    # print(idx)
    # here, `idx` is the Index of all the moves in with a cumulative contribution of `pct` or higher
    # print(group.loc[idx])
    return group.loc[idx].set_index(['B'], drop=True)['Moves']
    # return a Series of Moves with column `B` as index of the items which have index `idx`

 df.groupby('A').apply(select)

编辑

我在代码中添加了一些注释。 为了更清楚地说明其作用，我还添加了（注释）中间变量的打印语句。 如果您取消注释，第一组打印两次，请不要感到惊讶

Pandas GroupBy-如何保持行数不超过累积总和的百分比？

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-11-28 10:31:03

编辑

Pandas GroupBy-如何保持行数不超过累积总和的百分比？

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-11-28 10:31:03

编辑

解决方案1
1 已采纳 2017-11-28 10:31:03