简体   繁体   English

Pandas与groupby的部分元素的累积和

[英]Pandas cumulative sum of partial elements with groupby

Apologies if this question has already been asked, but thank you in advance for your help. 如果已经提出这个问题,请道歉,但提前感谢您的帮助。

In this "unpivoted" dataset, there are Orders composed of several Lots . 在这个“不透明”的数据集中,有几个由很多批次组成的订单 Each Lot has a given Point value, as below: 每个地段都有一个给定的Point值,如下所示:

CustID     Date         OrderNum   LotNum   PtsPerLot
A123       1/1/2015     1234       A        2            
A123       1/1/2015     1234       B        10
A123       1/1/2015     5678       A        7

My objective is to create a CUMULATIVE_POINTS_PER_YEAR column representing the cumulative sum of POINTS_PER_ORDER , which is itself a sum of PtsPerLot , at each Lot level. 我的目标是创建一个CUMULATIVE_POINTS_PER_YEAR列,表示POINTS_PER_ORDER的累积总和,它本身就是每个Lot级别的PtsPerLot的总和。 So, for a given lot, CumPtsPerYear would show the cumulative total of all POINTS_PER_ORDER for an account in a given year. 因此,对于给定的批次, CumPtsPerYear将显示给定年份中帐户的所有POINTS_PER_ORDER的累计总数。

CustID     Date         OrderNum   LotNum   PtsPerLot    *PtsPerOrder*    *CumPtsPerYear*
A123       1/1/2015     1234       A        2            12              12
A123       1/1/2015     1234       B        10           12              12
A123       1/1/2015     5678       A        7            7               19

Any ideas? 有任何想法吗? I've tried groupby.cumsum on PtsPerLot and another groupby.cumsum on PtsPerOrder , but it isn't producing what I need. 我试过groupby.cumsumPtsPerLot和另一groupby.cumsumPtsPerOrder ,但它不是生产什么,我需要。

First, calculate PtsPerOrder . 首先,计算PtsPerOrder Use transform to broadcast along the actual index of your dataframe the result of the calculation in each group: 使用transform可以沿着数据帧的实际索引广播每组中的计算结果:

df['PtsPerOrder'] = df.groupby('OrderNum')['PtsPerLot'].transform(sum)

Then take the first element of that new column in each group: 然后在每个组中获取该新列的第一个元素:

df['CumPtsPerYear'] = df.groupby('OrderNum')['PtsPerOrder'].head(1)

df
Out[27]: 
  CustID      Date  OrderNum LotNum  PtsPerLot  PtsPerOrder  CumPtsPerYear
0   A123  1/1/2015      1234      A          2           12           12.0
1   A123  1/1/2015      1234      B         10           12            NaN
2   A123  1/1/2015      5678      A          7            7            7.0

End the calculation by doing the cumulative sum you are searching for. 通过执行您要搜索的累积总和来结束计算。 It will skip the NA values. 它将跳过NA值。 You complete your dataframe with a forward fill: 您使用向前填充完成数据框:

df['CumPtsPerYear'].cumsum().ffill()

0    12.0
1    12.0
2    19.0

First you need to use a transformation : 首先,您需要使用转换

df['*PtsPerOrder*'] = df.groupby('OrderNum')['PtsPerLot'].transform(sum)

Then in order to create the other one, I didn't find another way that to find the max of each group, do a cumsum on that, and merge that back in: 然后,为了创建另一个,我没有找到另一种方法来找到每个组的最大值,对其做一个cumsum,并将其合并回来:

weird_cumsum = df.groupby('OrderNum')['*PtsPerOrder*'].max().cumsum().to_frame()
weird_cumsum.columns = ['*CumPtsPerYear*']
weird_cumsum

          *CumPtsPerYear*
OrderNum                 
1234                   12
5678                   19

df.merge(weird_cumsum, left_on='OrderNum', right_index=True, how='left')

Result is as expected: 结果如预期:

  CustID       Date  OrderNum LotNum  PtsPerLot  *PtsPerOrder*  *CumPtsPerYear* 
0   A123 2015-01-01      1234      A          2             12             12  
1   A123 2015-01-01      1234      B         10             12             12   
2   A123 2015-01-01      5678      A          7              7             19   

To get to the first part of your question, PtsPerOrder , you need a transformation . 要了解问题的第一部分PtsPerOrder ,您需要进行转换 sum is an aggregation. sum是一个聚合。 So use .transform : 所以使用.transform

In [10]: df
Out[10]:
            Date  OrderNum LotNum  PtsPerLot
CustID
A123    1/1/2015      1234      A          2
A123    1/1/2015      1234      B         10
A123    1/1/2015      5678      A          7

In [11]: df.groupby('OrderNum')['PtsPerLot'].transform('sum')
Out[11]:
CustID
A123    12
A123    12
A123     7
dtype: int64

And use that to create a new column... 并使用它来创建一个新列...

In [13]: df['PtsPerOrder'] = df.groupby('OrderNum')['PtsPerLot'].transform('sum')

In [14]: df
Out[14]:
            Date  OrderNum LotNum  PtsPerLot  PtsPerOrder
CustID
A123    1/1/2015      1234      A          2           12
A123    1/1/2015      1234      B         10           12
A123    1/1/2015      5678      A          7            7

I'm still not grokking your specification for CumPtsPerYear ... 我仍然没有按照你的CumPtsPerYear规范...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM