Pandas groupby 变换累积条件

Question

I have a large table with many product id's and iso_codes: 2 million rows in total.我有一个包含许多产品 ID 和 iso_codes 的大表：总共 200 万行。 So the answer should (if possible) also take into account memory issues, I have 16 GB memory.所以答案应该（如果可能）也考虑到内存问题，我有 16 GB 内存。

I would like to see for every (id, iso_code) combination what the number of items returned is before the buy_date in the row (so cumulative), but there's a catch :我想查看每个 (id, iso_code) 组合在该行中的 buy_date 之前返回的商品数量（如此累积），但有一个问题：
I only want to count returns that happened from previous sales where the return_date is before the buy_date I'm looking at.我只想计算之前销售中发生的退货，其中 return_date 在我正在查看的 buy_date 之前。

I've added column items_returned as an example: this is the column that should be calculated.我添加了列 items_returned作为示例：这是应该计算的列。

The idea is as follows:思路如下：
At the moment of the sale I can only count returns that have happened already, not the ones that will happen in the future.在销售的那一刻，我只能计算已经发生的退货，而不是将来会发生的退货。

I tried a combination of df.groupby(['id', 'iso_code']).transform(np.cumsum) and .transform(lambda x: only count returns that happened before my buy_date) , but couldn't figure out how to do a .groupby.transform(np.cumsum) with these special conditions applying.我尝试了df.groupby(['id', 'iso_code']).transform(np.cumsum)和.transform(lambda x: only count returns that happened before my buy_date) ，但不知道如何在应用这些特殊条件的情况下执行.groupby.transform(np.cumsum) 。

Similar question for items bought, where I only count items cumulative for days smaller than my buy_date.购买物品的类似问题，我只计算比我的buy_date小的天数的累计物品。

Hope you can help me.希望您能够帮助我。

Example resulting table:结果表示例：

+-------+------+------------+----------+------------+---------------+----------------+------------------+
|   row |   id | iso_code   |   return | buy_date   | return_date   |   items_bought |   items_returned |
|-------+------+------------+----------+------------+---------------+----------------+------------------|
|     0 |  177 | DE         |        1 | 2019-05-16 | 2019-05-24    |              0 |                0 |
|     1 |  177 | DE         |        1 | 2019-05-29 | 2019-06-03    |              1 |                1 |
|     2 |  177 | DE         |        1 | 2019-10-27 | 2019-11-06    |              2 |                2 |
|     3 |  177 | DE         |        0 | 2019-11-06 | None          |              3 |                2 |
|     4 |  177 | DE         |        1 | 2019-11-18 | 2019-11-28    |              4 |                3 |
|     5 |  177 | DE         |        1 | 2019-11-21 | 2019-12-11    |              5 |                3 |
|     6 |  177 | DE         |        1 | 2019-11-25 | 2019-12-06    |              6 |                3 |
|     7 |  177 | DE         |        0 | 2019-11-30 | None          |              7 |                4 |
|     8 |  177 | DE         |        1 | 2020-04-30 | 2020-05-27    |              8 |                6 |
|     9 |  177 | DE         |        1 | 2020-04-30 | 2020-09-18    |              8 |                6 |
+-------+------+------------+----------+------------+---------------+----------------+------------------+

Sample code:示例代码：

import pandas as pd
from io import StringIO

df_text = """
row id  iso_code    return  buy_date    return_date
0   177 DE  1   2019-05-16  2019-05-24
1   177 DE  1   2019-05-29  2019-06-03
2   177 DE  1   2019-10-27  2019-11-06
3   177 DE  0   2019-11-06  None
4   177 DE  1   2019-11-18  2019-11-28
5   177 DE  1   2019-11-21  2019-12-11
6   177 DE  1   2019-11-25  2019-12-06
7   177 DE  0   2019-11-30  None
8   177 DE  1   2020-04-30  2020-05-27
9   177 DE  1   2020-04-30  2020-09-18
"""

df = pd.read_csv(StringIO(df_text), sep='\t', index_col=0)

df['items_bought'] = [0, 1, 2, 3, 4, 5, 6, 7, 8, 8]
df['items_returned'] = [0, 1, 2, 2, 3, 3, 3, 4, 6, 6]

Answer 1

This seems to require a cross merge:这似乎需要交叉合并：

(df[['id','iso_code', 'buy_date']].reset_index()
   .merge(df[['id','iso_code', 'return','return_date','buy_date']], on=['id','iso_code'])
   .assign(items_returned=lambda x: x['return_date'].lt(x['buy_date_x'])*x['return'],
           items_bought=lambda x: x['buy_date_y'].lt(x['buy_date_x']))
   .groupby('row')[['items_bought','items_returned']].sum()
)

Output:输出：

     items_bought  items_returned
row                              
0               0               0
1               1               1
2               2               2
3               3               2
4               4               3
5               5               3
6               6               3
7               7               4
8               8               6
9               8               6

Update for larger data, cross merge is not ideal due to memory requirement.更新较大的数据，由于内存要求，交叉合并并不理想。 We can then do a groupby() so we only merge on smaller groups:然后我们可以执行groupby()以便我们只合并较小的组：

def myfunc(df):
    return (df[['id','iso_code', 'buy_date']].reset_index()
   .merge(df[['id','iso_code', 'return','return_date','buy_date']], on=['id','iso_code'])
   .assign(items_returned=lambda x: x['return_date'].lt(x['buy_date_x'])*x['return'],
           items_bought=lambda x: x['buy_date_y'].lt(x['buy_date_x']))
   .groupby('row')[['items_bought','items_returned']].sum()
)

df.groupby(['id','iso_code']).apply(myfunc).reset_index(level=[0,1], drop=True)

And you would get the same output:你会得到相同的输出：

     items_bought  items_returned
row                              
0               0               0
1               1               1
2               2               2
3               3               2
4               4               3
5               5               3
6               6               3
7               7               4
8               8               6
9               8               6

Pandas groupby 变换累积条件

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-10-15 14:07:01

Pandas groupby 变换累积条件

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-10-15 14:07:01

解决方案1
1 已采纳 2020-10-15 14:07:01