如何在 pandas 中加速條件分組總和

Question

我有一個包含大量行的 dataframe，我想對這個 dataframe 進行條件分組。

這是我的 dataframe 和代碼的示例：

import pandas as pd

data = {'Case': [1, 1, 1, 1, 1, 1],
        'Id': [1, 1, 1, 1, 2, 2],
        'Date1': ['2020-01-01', '2020-01-01', '2020-02-01', '2020-02-01', '2020-01-01', '2020-01-01'],
        'Date2': ['2020-01-01', '2020-02-01', '2020-01-01', '2020-02-01', '2020-01-01', '2020-02-01'],
        'Quantity': [50,100,150,20,30,35]
        }

df = pd.DataFrame(data)

df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])

sum_list = []


for d in df['Date1'].unique():
    temp = df.groupby(['Case','Id']).apply(lambda x: x[(x['Date2'] == d) & (x['Date1']<d)]['Quantity'].sum()).rename('sum').to_frame()
    temp['Date'] = d
    sum_list.append(temp)
    

output = pd.concat(sum_list, axis=0).reset_index()

當我將這個for循環應用於真正的 dataframe 時，它非常慢。 我想找到一種更好的方法來執行此條件 groupby 求和操作。 這是我的問題：

for循環是做我需要做的事情的好方法嗎？
有沒有更好的方法來替換for循環中的第 1 行；
我感覺for循環里面的第2行也很耗時，應該怎么改進呢。

謝謝你的幫助。

Answer 1

apply是慢的。 盡可能避免它。

我用你的小片段測試了這個，它給出了正確的答案。 您需要使用真實數據進行更徹底的測試：

case = df["Case"].unique()
id_= df["Id"].unique()
d = df["Date1"].unique()
index = pd.MultiIndex.from_product([case, id_, d], names=["Case", "Id", "Date"])

# Sum only rows whose Date2 belong to a specific list of dates
# This is equivalent to `x['Date2'] == d` in your original code
cond = df["Date2"].isin(d)
tmp = df[cond].groupby(["Case", "Id", "Date1", "Date2"], as_index=False).sum()

# Select only those sums where Date1 < Date2 and sum again
# This takes care of the `x['Date1'] < d` condition
cond = tmp["Date1"] < tmp["Date2"]
output = tmp[cond].groupby(["Case", "Id", "Date2"]).sum().reindex(index, fill_value=0).reset_index()

Answer 2

一種選擇是雙重合並和 groupby：

date = pd.Series(df.Date1.unique(), name='Date')
step1 = df.merge(date, left_on = 'Date2', right_on = 'Date', how = 'outer')
step2 = step1.loc[step1.Date1 < step1.Date]
step2 = step2.groupby(['Case', 'Id', 'Date']).agg(sum=('Quantity','sum'))
(df
.loc[:, ['Case', 'Id', 'Date2']]
.drop_duplicates()
.rename(columns={'Date2':'Date'})
.merge(step2, how = 'left', on = ['Case', 'Id', 'Date'])
.fillna({'sum': 0}, downcast='infer')
)

   Case  Id       Date  sum
0     1   1 2020-01-01    0
1     1   1 2020-02-01  100
2     1   2 2020-01-01    0
3     1   2 2020-02-01   35

Answer 3

另一個解決方案：

x = df.groupby(["Case", "Id", "Date1"], as_index=False).apply(
    lambda x: x.loc[x["Date1"] < x["Date2"], "Quantity"].sum()
)

print(
    x.pivot(index=["Case", "Id"], columns="Date1", values=None)
    .fillna(0)
    .melt(ignore_index=False)
    .drop(columns=[None])
    .reset_index()
    .rename(columns={"Date1": "Date", "value":"sum"})
)

印刷：

   Case  Id       Date    sum
0     1   1 2020-01-01  100.0
1     1   2 2020-01-01   35.0
2     1   1 2020-02-01    0.0
3     1   2 2020-02-01    0.0

如何在 pandas 中加速條件分組總和

問題描述

3 個解決方案

解決方案1
2 2022-05-02 21:14:51

解決方案2
2 已采納 2022-05-02 23:36:00

解決方案3
1 2022-05-02 21:33:25

如何在 pandas 中加速條件分組總和

問題描述

3 個解決方案

解決方案1 2 2022-05-02 21:14:51

解決方案2 2 已采納 2022-05-02 23:36:00

解決方案3 1 2022-05-02 21:33:25

解決方案1
2 2022-05-02 21:14:51

解決方案2
2 已采納 2022-05-02 23:36:00

解決方案3
1 2022-05-02 21:33:25