使用数据框计算销售额需要太长时间

Question

I have a problem.我有个问题。 I would like to calculate the turnover for a customer in the last 6 months.我想计算一个客户在过去 6 个月内的营业额。 The methods work on my dummy record, unfortunately the whole thing does not work on my real record as it is too slow.这些方法适用于我的虚拟记录，不幸的是整个事情不适用于我的真实记录，因为它太慢了。 How can I rewrite this so that it performs faster?我怎样才能重写它以使其执行得更快？

Dataframe数据框

   customerId   fromDate  sales
0           1 2022-06-01    100
1           1 2022-05-25     20
2           1 2022-05-25     50
3           1 2022-05-20     30
4           1 2021-09-05     40
5           2 2022-06-02     80
6           3 2021-03-01     50
7           3 2021-02-01     20

Code代码

from datetime import datetime
from dateutil.relativedelta import relativedelta

import pandas as pd


def find_last_date(date_: datetime) -> datetime:
    six_months = date_ + relativedelta(months=-6)
    return six_months


def sum_func(row: pd.DataFrame, df: pd.DataFrame) -> int :
    return df[
            (df["customerId"] == row["customerId"])
             & (row["fromDate"] + relativedelta(months=-6)<= df["fromDate"])
             & (df["fromDate"]   <= row["fromDate"])
        ]["sales"].sum()

d = {
    "customerId": [1, 1, 1, 1, 1, 2, 3, 3],
    "fromDate": [
        "2022-06-01",
        "2022-05-25",
        "2022-05-25",
        "2022-05-20",
        "2021-09-05",
        "2022-06-02",
        "2021-03-01",
        "2021-02-01",
    ],
    "sales": [100, 20, 50, 30, 40, 80, 50, 20],
}
df = pd.DataFrame(data=d)

df["fromDate"] = pd.to_datetime(df["fromDate"], errors="coerce")
df["last_month"] = df["fromDate"].apply(find_last_date)
df["total_sales"]=df[["customerId", "fromDate"]].apply(lambda x: sum_func(x, df), axis=1)
print(df)

What I want我想要的是

   customerId   fromDate  sales last_month total_sales
0           1 2022-06-01    100 2022-03-01        200 # 100 + 20 + 50 + 30
1           1 2022-05-25     20 2022-02-25        100 # 20 + 50 + 30
2           1 2022-05-25     50 2022-02-25        100 # 50 + 20 + 30
3           1 2022-05-20     30 2022-02-20        30  # 30
4           1 2021-09-05     40 2021-06-05        40  # 40
5           2 2022-06-02     80 2022-03-02        80  # 80
6           3 2021-03-01     50 2020-12-01        70  # 50 + 20
7           3 2021-02-01     20 2020-11-01        20  # 20

print(df['customerId'].value_counts().describe())

count    53979.000
mean        87.404
std       1588.450
min          1.000
25%          2.000
50%          6.000
75%         22.000
max     205284.000

print(df['fromDate'].agg((min, max)))

min   2021-02-22
max   2022-03-26

Answer 1

Use numpy broadcasting per groups with numpy.where for set for True values of Sales and if not match 0 , so possible sum sales to new column:使用numpy.where为每个组使用 numpy 广播来设置Sales的True值，如果不匹配0 ，则可能将销售总和到新列：

df["fromDate"] = pd.to_datetime(df["fromDate"], errors="coerce")
df["last_month"] = df["fromDate"] - pd.offsets.DateOffset(months=6)

def f(x):
    d1 = x["fromDate"].to_numpy()
    d2 = x["last_month"].to_numpy()
    mask = (d2[:, None]<=d1) & (d1<=d1[:, None])
    x['total_sales'] = np.dot(mask, x['sales'].to_numpy())
    return x

df = df.groupby('customerId').apply(f)

print(df)
   customerId   fromDate  sales last_month  total_sales
0           1 2022-06-01    100 2021-12-01          200
1           1 2022-05-25     20 2021-11-25          100
2           1 2022-05-25     50 2021-11-25          100
3           1 2022-05-20     30 2021-11-20           30
4           1 2021-09-05     40 2021-03-05           40
5           2 2022-06-02     80 2021-12-02           80
6           3 2021-03-01     50 2020-09-01           70
7           3 2021-02-01     20 2020-08-01           20

EDIT:编辑：

df["fromDate"] = pd.to_datetime(df["fromDate"], errors="coerce")
df["last_month"] = df["fromDate"] - pd.offsets.DateOffset(months=6)

#https://stackoverflow.com/a/27670190/2901002
def chunking_dot(big_matrix, small_matrix, chunk_size=10000):
    # Make a copy if the array is not already contiguous
    small_matrix = np.ascontiguousarray(small_matrix)
    R = np.empty((big_matrix.shape[0], small_matrix.shape[1]))
    for i in range(0, R.shape[0], chunk_size):
        end = i + chunk_size
        R[i:end] = np.dot(big_matrix[i:end], small_matrix)
    return R

def f(x):
    d1 = x["fromDate"].to_numpy()
    d2 = x["last_month"].to_numpy()
    mask = (d2[:, None]<=d1) & (d1<=d1[:, None])
    # print (mask)

    x['total_sales'] = chunking_dot(mask, x[['sales']].to_numpy())

    return x

df = df.groupby('customerId').apply(f)

print(df)
   customerId   fromDate  sales last_month  total_sales
0           1 2022-06-01    100 2021-12-01        200.0
1           1 2022-05-25     20 2021-11-25        100.0
2           1 2022-05-25     50 2021-11-25        100.0
3           1 2022-05-20     30 2021-11-20         30.0
4           1 2021-09-05     40 2021-03-05         40.0
5           2 2022-06-02     80 2021-12-02         80.0
6           3 2021-03-01     50 2020-09-01         70.0
7           3 2021-02-01     20 2020-08-01         20.0

Answer 2

Using multiprocessing and consider 6 months as 180 days to reduce the memory size and the time computing.使用多处理并将 6 个月视为 180 天，以减少内存大小和计算时间。

Copy the following code to a python file and run it from the console (not from a Jupyter Notebook)将以下代码复制到 python 文件并从控制台运行它（而不是从 Jupyter Notebook）

import pandas as pd
import numpy as np
import multiprocessing as mp
import time

def sum_sales(customer, df):
    # 1st pass: sum sales of same days, reduce the row numbers
    df1 = df.groupby('fromDate')['sales'].sum()

    # Generate all missing dates
    df1 = df1.reindex(pd.date_range(df1.index.min(), df1.index.max(), freq='D'), fill_value=0)

    # 2nd pass: use a sliding window of 180 days to sum
    df1 = df1.rolling(90, min_periods=0).sum().astype(int)

    # Restore original index for the group
    df1 = df1.reindex(df['fromDate']).reset_index(drop=True).to_frame().set_index(df.index)

    return df1


if __name__ == '__main__':  # Do not remove this line! Mandatory
    # Setup a minimal reproducible example
    N = 3_000_000
    D = pd.to_datetime('2021-1-1')
    rng = np.random.default_rng(2022)
    dti = D + pd.to_timedelta(rng.integers(0, 365*2, N), unit='D')
    cust = rng.integers(0, 75000, N)
    sales = rng.integers(1, 200, N)
    df = pd.DataFrame({'customerId': cust, 'fromDate': dti, 'sales': sales})

    # Ensure your dataframe is sorted by fromDate for rolling window
    df.sort_values(['customerId', 'fromDate'], ignore_index=True)

    start = time.time()
    with mp.Pool(mp.cpu_count() - 1) as p:
        results = p.starmap(sum_sales, df.groupby('customerId'))
    df['total_sales'] = pd.concat(results)
    end = time.time()
    print(f"Elapsed time: {end - start:.2f} seconds")

For 3mio records and 75k different customers on 2 years (730 days) 2 年（730 天）内 3mio 记录和 75k 不同客户

[...]$ python mp.py
Elapsed time: 24.36 seconds

However the number of sales per customer is well balanced than your:但是，每位客户的销售数量与您的：

>>> df['customerId'].value_counts().describe(percentiles=np.linspace(0, 1, 11)
count    75000.000000
mean        40.000000
std          6.349157
min         15.000000
0%          15.000000
10%         32.000000
20%         35.000000
30%         37.000000
40%         38.000000
50%         40.000000
60%         41.000000
70%         43.000000
80%         45.000000
90%         48.000000  # <- check the 90th percentile of your data
100%        73.000000
max         73.000000  # <- max transactions for a single customer
Name: customerId, dtype: float64

Because the sales are properly distributed per customer, my sample takes advantage of multiprocessing.因为每个客户的销售额是正确分配的，所以我的示例利用了多处理。 In your case, I don't think it will be the case (check the 90th percentile).在您的情况下，我认为不会是这种情况（检查第 90 个百分位）。

The check with your dataframe:检查您的数据框：

>>> df
   customerId   fromDate  sales  total_sales
0           1 2022-06-01    100          200
1           1 2022-05-25     20          100
2           1 2022-05-25     50          100
3           1 2022-05-20     30           30
4           1 2021-09-05     40           40
5           2 2022-06-02     80           80
6           3 2021-03-01     50           70
7           3 2021-02-01     20           20

If you decide to choose to keep a variable moving window of 6 months instead of a fixed moving window of 180 days, the algorithm will me the same.如果您决定选择保留 6 个月的可变移动窗口而不是 180 天的固定移动窗口，那么算法将与我相同。 The important point in the code is to reduce the number of rows per customer.代码中的重点是减少每个客户的行数。 In your sample, you can group the sales for a same (customer, date).在您的示例中，您可以将销售分组为相同（客户、日期）。 The customer 1 have 2 rows for 2022-05-25 so you can sum them immediately.客户 1 有 2 行 2022-05-25，因此您可以立即对它们求和。

IIUC, in your real data, you have a customer with 205284 sales between 2021-02-22 and 2022-03-26 (397 days), so this user has an average of 517 transactions per day (?). IIUC，在您的真实数据中，您有一个客户在 2021-02-22 和 2022-03-26（397 天）之间有 205284 笔销售额，因此该用户平均每天有 517 笔交易（？）。 If you sum sales of same days, you reduce the number of records from 205284 to 397...如果您将同一天的销售额相加，您会将记录数从 205284 减少到 397...

使用数据框计算销售额需要太长时间

问题描述

2 个解决方案

解决方案1
2 2022-06-09 06:17:25

解决方案2
1 已采纳 2022-06-09 20:11:00

使用数据框计算销售额需要太长时间

问题描述

2 个解决方案

解决方案1 2 2022-06-09 06:17:25

解决方案2 1 已采纳 2022-06-09 20:11:00

解决方案1
2 2022-06-09 06:17:25

解决方案2
1 已采纳 2022-06-09 20:11:00