简体   繁体   English

Pandas 基于多列值的Groupby

[英]Pandas Groupby Based on Values in Multiple Columns

I have a dataframe that I am trying to use pandas.groupby on to get the cumulative sum.我有一个dataframe ,我正在尝试使用pandas.groupby来获取累计和。 The values that I am grouping by show up in two different columns, and I am having trouble getting the groupby to work correctly.我分组依据的值显示在两个不同的列中,我无法让分组依据正常工作。 My starting dataframe is:我的起始dataframe是:

df = pd.DataFrame({'col_A': ['red', 'red', 'blue', 'red'], 'col_B': ['blue', 'red', 'blue', 'red'], 'col_A_qty': [1, 1, 1, 1], 'col_B_qty': [1, 1, 1, 1]})

col_A   col_B   col_A_qty   col_B_qty
red      blue      1           1
red      red       1           1
blue    blue       1           1
red      red       1           1

The result I am trying to get is:我想要得到的结果是:

col_A   col_B   col_A_qty   col_B_qty   red_cumsum  blue_cumsum
red     blue       1            1           1           1
red     red        1            1           3           1
blue    blue       1            1           3           3
red     red        1            1           5           3

I've tried:我试过了:

df.groupby(['col_A', 'col_B'])['col_A_qty'].cumsum()

but this groups on the combination of col_A and col_B .但这组基于col_Acol_B的组合。 How can I use pandas.groupby to calculate the cumulative sum of red and blue, regardless of if it's in col_A or col_B ?我如何使用pandas.groupby来计算红色和蓝色的累积和,无论它是在col_A还是col_B

Try two pivot试试两个pivot

out = pd.pivot(df,columns='col_A',values='col_A_qty').fillna(0).cumsum().add(pd.pivot(df,columns='col_B',values='col_B_qty').fillna(0).cumsum(),fill_value=0)
Out[404]: 
col_A  blue  red
0       1.0  1.0
1       1.0  3.0
2       3.0  3.0
3       3.0  5.0
df = df.join(out)

A simple method is to define each cumsum column by two Series.cumsum , as follows:一种简单的方法是通过两个Series.cumsum定义每个cumsum列,如下所示:

df['red_cumsum'] = df['col_A'].eq('red').cumsum() + df['col_B'].eq('red').cumsum()
df['blue_cumsum'] = df['col_A'].eq('blue').cumsum() + df['col_B'].eq('blue').cumsum()

In each column col_A and col_B , check for values equal 'red' / 'blue' (results are boolean series).在每一列col_Acol_B中,检查值是否等于'red' / 'blue' (结果为 boolean 系列)。 Then, we use Series.cumsum on these resultant boolean series to get the cumulative counts.然后,我们对这些结果 boolean 系列使用Series.cumsum来获得累积计数。 You don't really need to use pandas.groupby in this use case.在此用例中,您实际上不需要使用pandas.groupby

If you have multiple items in col_A and col_B , you can also iterate through the unique item list, as follows:如果您在col_Acol_B中有多个项目,您还可以遍历唯一项目列表,如下所示:

for item in np.unique(df[['col_A', 'col_B']]):
    df[f'{item}_cumsum'] = df['col_A'].eq(item).cumsum() + df['col_B'].eq(item).cumsum()

Result:结果:

print(df)

  col_A col_B  col_A_qty  col_B_qty  red_cumsum  blue_cumsum
0   red  blue          1          1           1            1
1   red   red          1          1           3            1
2  blue  blue          1          1           3            3
3   red   red          1          1           5            3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM