[英]Pandas cumulative sum of partial elements with groupby
Apologies if this question has already been asked, but thank you in advance for your help. 如果已经提出这个问题,请道歉,但提前感谢您的帮助。
In this "unpivoted" dataset, there are Orders composed of several Lots . 在这个“不透明”的数据集中,有几个由很多批次组成的订单 。 Each Lot has a given Point value, as below:
每个地段都有一个给定的Point值,如下所示:
CustID Date OrderNum LotNum PtsPerLot
A123 1/1/2015 1234 A 2
A123 1/1/2015 1234 B 10
A123 1/1/2015 5678 A 7
My objective is to create a CUMULATIVE_POINTS_PER_YEAR
column representing the cumulative sum of POINTS_PER_ORDER
, which is itself a sum of PtsPerLot
, at each Lot
level. 我的目标是创建一个
CUMULATIVE_POINTS_PER_YEAR
列,表示POINTS_PER_ORDER
的累积总和,它本身就是每个Lot
级别的PtsPerLot
的总和。 So, for a given lot, CumPtsPerYear
would show the cumulative total of all POINTS_PER_ORDER
for an account in a given year. 因此,对于给定的批次,
CumPtsPerYear
将显示给定年份中帐户的所有POINTS_PER_ORDER
的累计总数。
CustID Date OrderNum LotNum PtsPerLot *PtsPerOrder* *CumPtsPerYear*
A123 1/1/2015 1234 A 2 12 12
A123 1/1/2015 1234 B 10 12 12
A123 1/1/2015 5678 A 7 7 19
Any ideas? 有任何想法吗? I've tried
groupby.cumsum
on PtsPerLot
and another groupby.cumsum
on PtsPerOrder
, but it isn't producing what I need. 我试过
groupby.cumsum
上PtsPerLot
和另一groupby.cumsum
上PtsPerOrder
,但它不是生产什么,我需要。
First, calculate PtsPerOrder
. 首先,计算
PtsPerOrder
。 Use transform
to broadcast along the actual index of your dataframe the result of the calculation in each group: 使用
transform
可以沿着数据帧的实际索引广播每组中的计算结果:
df['PtsPerOrder'] = df.groupby('OrderNum')['PtsPerLot'].transform(sum)
Then take the first element of that new column in each group: 然后在每个组中获取该新列的第一个元素:
df['CumPtsPerYear'] = df.groupby('OrderNum')['PtsPerOrder'].head(1)
df
Out[27]:
CustID Date OrderNum LotNum PtsPerLot PtsPerOrder CumPtsPerYear
0 A123 1/1/2015 1234 A 2 12 12.0
1 A123 1/1/2015 1234 B 10 12 NaN
2 A123 1/1/2015 5678 A 7 7 7.0
End the calculation by doing the cumulative sum you are searching for. 通过执行您要搜索的累积总和来结束计算。 It will skip the NA values.
它将跳过NA值。 You complete your dataframe with a forward fill:
您使用向前填充完成数据框:
df['CumPtsPerYear'].cumsum().ffill()
0 12.0
1 12.0
2 19.0
First you need to use a transformation : 首先,您需要使用转换 :
df['*PtsPerOrder*'] = df.groupby('OrderNum')['PtsPerLot'].transform(sum)
Then in order to create the other one, I didn't find another way that to find the max of each group, do a cumsum on that, and merge that back in: 然后,为了创建另一个,我没有找到另一种方法来找到每个组的最大值,对其做一个cumsum,并将其合并回来:
weird_cumsum = df.groupby('OrderNum')['*PtsPerOrder*'].max().cumsum().to_frame()
weird_cumsum.columns = ['*CumPtsPerYear*']
weird_cumsum
*CumPtsPerYear*
OrderNum
1234 12
5678 19
df.merge(weird_cumsum, left_on='OrderNum', right_index=True, how='left')
Result is as expected: 结果如预期:
CustID Date OrderNum LotNum PtsPerLot *PtsPerOrder* *CumPtsPerYear*
0 A123 2015-01-01 1234 A 2 12 12
1 A123 2015-01-01 1234 B 10 12 12
2 A123 2015-01-01 5678 A 7 7 19
To get to the first part of your question, PtsPerOrder
, you need a transformation . 要了解问题的第一部分
PtsPerOrder
,您需要进行转换 。 sum
is an aggregation. sum
是一个聚合。 So use .transform
: 所以使用
.transform
:
In [10]: df
Out[10]:
Date OrderNum LotNum PtsPerLot
CustID
A123 1/1/2015 1234 A 2
A123 1/1/2015 1234 B 10
A123 1/1/2015 5678 A 7
In [11]: df.groupby('OrderNum')['PtsPerLot'].transform('sum')
Out[11]:
CustID
A123 12
A123 12
A123 7
dtype: int64
And use that to create a new column... 并使用它来创建一个新列...
In [13]: df['PtsPerOrder'] = df.groupby('OrderNum')['PtsPerLot'].transform('sum')
In [14]: df
Out[14]:
Date OrderNum LotNum PtsPerLot PtsPerOrder
CustID
A123 1/1/2015 1234 A 2 12
A123 1/1/2015 1234 B 10 12
A123 1/1/2015 5678 A 7 7
I'm still not grokking your specification for CumPtsPerYear ... 我仍然没有按照你的CumPtsPerYear规范...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.