[英]Pandas groupby transform cumulative with conditions
I have a large table with many product id's and iso_codes: 2 million rows in total.我有一个包含许多产品 ID 和 iso_codes 的大表:总共 200 万行。 So the answer should (if possible) also take into account memory issues, I have 16 GB memory.
所以答案应该(如果可能)也考虑到内存问题,我有 16 GB 内存。
I would like to see for every (id, iso_code) combination what the number of items returned is before the buy_date in the row (so cumulative), but there's a catch :我想查看每个 (id, iso_code) 组合在该行中的 buy_date 之前返回的商品数量(如此累积),但有一个问题:
I only want to count returns that happened from previous sales where the return_date is before the buy_date I'm looking at.我只想计算之前销售中发生的退货,其中 return_date 在我正在查看的 buy_date 之前。
I've added column items_returned as an example: this is the column that should be calculated.我添加了列 items_returned作为示例:这是应该计算的列。
The idea is as follows:思路如下:
At the moment of the sale I can only count returns that have happened already, not the ones that will happen in the future.在销售的那一刻,我只能计算已经发生的退货,而不是将来会发生的退货。
I tried a combination of df.groupby(['id', 'iso_code']).transform(np.cumsum)
and .transform(lambda x: only count returns that happened before my buy_date)
, but couldn't figure out how to do a .groupby.transform(np.cumsum)
with these special conditions applying.我尝试了
df.groupby(['id', 'iso_code']).transform(np.cumsum)
和.transform(lambda x: only count returns that happened before my buy_date)
,但不知道如何在应用这些特殊条件的情况下执行.groupby.transform(np.cumsum)
。
Similar question for items bought, where I only count items cumulative for days smaller than my buy_date.购买物品的类似问题,我只计算比我的buy_date小的天数的累计物品。
Hope you can help me.希望您能够帮助我。
Example resulting table:结果表示例:
+-------+------+------------+----------+------------+---------------+----------------+------------------+
| row | id | iso_code | return | buy_date | return_date | items_bought | items_returned |
|-------+------+------------+----------+------------+---------------+----------------+------------------|
| 0 | 177 | DE | 1 | 2019-05-16 | 2019-05-24 | 0 | 0 |
| 1 | 177 | DE | 1 | 2019-05-29 | 2019-06-03 | 1 | 1 |
| 2 | 177 | DE | 1 | 2019-10-27 | 2019-11-06 | 2 | 2 |
| 3 | 177 | DE | 0 | 2019-11-06 | None | 3 | 2 |
| 4 | 177 | DE | 1 | 2019-11-18 | 2019-11-28 | 4 | 3 |
| 5 | 177 | DE | 1 | 2019-11-21 | 2019-12-11 | 5 | 3 |
| 6 | 177 | DE | 1 | 2019-11-25 | 2019-12-06 | 6 | 3 |
| 7 | 177 | DE | 0 | 2019-11-30 | None | 7 | 4 |
| 8 | 177 | DE | 1 | 2020-04-30 | 2020-05-27 | 8 | 6 |
| 9 | 177 | DE | 1 | 2020-04-30 | 2020-09-18 | 8 | 6 |
+-------+------+------------+----------+------------+---------------+----------------+------------------+
Sample code:示例代码:
import pandas as pd
from io import StringIO
df_text = """
row id iso_code return buy_date return_date
0 177 DE 1 2019-05-16 2019-05-24
1 177 DE 1 2019-05-29 2019-06-03
2 177 DE 1 2019-10-27 2019-11-06
3 177 DE 0 2019-11-06 None
4 177 DE 1 2019-11-18 2019-11-28
5 177 DE 1 2019-11-21 2019-12-11
6 177 DE 1 2019-11-25 2019-12-06
7 177 DE 0 2019-11-30 None
8 177 DE 1 2020-04-30 2020-05-27
9 177 DE 1 2020-04-30 2020-09-18
"""
df = pd.read_csv(StringIO(df_text), sep='\t', index_col=0)
df['items_bought'] = [0, 1, 2, 3, 4, 5, 6, 7, 8, 8]
df['items_returned'] = [0, 1, 2, 2, 3, 3, 3, 4, 6, 6]
This seems to require a cross merge:这似乎需要交叉合并:
(df[['id','iso_code', 'buy_date']].reset_index()
.merge(df[['id','iso_code', 'return','return_date','buy_date']], on=['id','iso_code'])
.assign(items_returned=lambda x: x['return_date'].lt(x['buy_date_x'])*x['return'],
items_bought=lambda x: x['buy_date_y'].lt(x['buy_date_x']))
.groupby('row')[['items_bought','items_returned']].sum()
)
Output:输出:
items_bought items_returned
row
0 0 0
1 1 1
2 2 2
3 3 2
4 4 3
5 5 3
6 6 3
7 7 4
8 8 6
9 8 6
Update for larger data, cross merge is not ideal due to memory requirement.更新较大的数据,由于内存要求,交叉合并并不理想。 We can then do a
groupby()
so we only merge on smaller groups:然后我们可以执行
groupby()
以便我们只合并较小的组:
def myfunc(df):
return (df[['id','iso_code', 'buy_date']].reset_index()
.merge(df[['id','iso_code', 'return','return_date','buy_date']], on=['id','iso_code'])
.assign(items_returned=lambda x: x['return_date'].lt(x['buy_date_x'])*x['return'],
items_bought=lambda x: x['buy_date_y'].lt(x['buy_date_x']))
.groupby('row')[['items_bought','items_returned']].sum()
)
df.groupby(['id','iso_code']).apply(myfunc).reset_index(level=[0,1], drop=True)
And you would get the same output:你会得到相同的输出:
items_bought items_returned
row
0 0 0
1 1 1
2 2 2
3 3 2
4 4 3
5 5 3
6 6 3
7 7 4
8 8 6
9 8 6
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.