Pandas 基于多列值的Groupby

Question

I have a dataframe that I am trying to use pandas.groupby on to get the cumulative sum.我有一个dataframe ，我正在尝试使用pandas.groupby来获取累计和。 The values that I am grouping by show up in two different columns, and I am having trouble getting the groupby to work correctly.我分组依据的值显示在两个不同的列中，我无法让分组依据正常工作。 My starting dataframe is:我的起始dataframe是：

df = pd.DataFrame({'col_A': ['red', 'red', 'blue', 'red'], 'col_B': ['blue', 'red', 'blue', 'red'], 'col_A_qty': [1, 1, 1, 1], 'col_B_qty': [1, 1, 1, 1]})

col_A   col_B   col_A_qty   col_B_qty
red      blue      1           1
red      red       1           1
blue    blue       1           1
red      red       1           1

The result I am trying to get is:我想要得到的结果是：

col_A   col_B   col_A_qty   col_B_qty   red_cumsum  blue_cumsum
red     blue       1            1           1           1
red     red        1            1           3           1
blue    blue       1            1           3           3
red     red        1            1           5           3

I've tried:我试过了：

df.groupby(['col_A', 'col_B'])['col_A_qty'].cumsum()

but this groups on the combination of col_A and col_B .但这组基于col_A和col_B的组合。 How can I use pandas.groupby to calculate the cumulative sum of red and blue, regardless of if it's in col_A or col_B ?我如何使用pandas.groupby来计算红色和蓝色的累积和，无论它是在col_A还是col_B ？

Answer 1

Try two pivot试试两个pivot

out = pd.pivot(df,columns='col_A',values='col_A_qty').fillna(0).cumsum().add(pd.pivot(df,columns='col_B',values='col_B_qty').fillna(0).cumsum(),fill_value=0)
Out[404]: 
col_A  blue  red
0       1.0  1.0
1       1.0  3.0
2       3.0  3.0
3       3.0  5.0
df = df.join(out)

Answer 2

A simple method is to define each cumsum column by two Series.cumsum , as follows:一种简单的方法是通过两个Series.cumsum定义每个cumsum列，如下所示：

df['red_cumsum'] = df['col_A'].eq('red').cumsum() + df['col_B'].eq('red').cumsum()
df['blue_cumsum'] = df['col_A'].eq('blue').cumsum() + df['col_B'].eq('blue').cumsum()

In each column col_A and col_B , check for values equal 'red' / 'blue' (results are boolean series).在每一列col_A和col_B中，检查值是否等于'red' / 'blue' （结果为 boolean 系列）。 Then, we use Series.cumsum on these resultant boolean series to get the cumulative counts.然后，我们对这些结果 boolean 系列使用Series.cumsum来获得累积计数。 You don't really need to use pandas.groupby in this use case.在此用例中，您实际上不需要使用pandas.groupby 。

If you have multiple items in col_A and col_B , you can also iterate through the unique item list, as follows:如果您在col_A和col_B中有多个项目，您还可以遍历唯一项目列表，如下所示：

for item in np.unique(df[['col_A', 'col_B']]):
    df[f'{item}_cumsum'] = df['col_A'].eq(item).cumsum() + df['col_B'].eq(item).cumsum()

Result:结果：

print(df)

  col_A col_B  col_A_qty  col_B_qty  red_cumsum  blue_cumsum
0   red  blue          1          1           1            1
1   red   red          1          1           3            1
2  blue  blue          1          1           3            3
3   red   red          1          1           5            3

Pandas 基于多列值的Groupby

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-09-27 18:06:19

解决方案2
0 2021-09-27 20:39:29

Pandas 基于多列值的Groupby

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-09-27 18:06:19

解决方案2 0 2021-09-27 20:39:29

解决方案1
1 已采纳 2021-09-27 18:06:19

解决方案2
0 2021-09-27 20:39:29