[英]Calculation of sales with a dataframe takes too long
I have a problem.我有个问题。 I would like to calculate the turnover for a customer in the last 6 months.
我想计算一个客户在过去 6 个月内的营业额。 The methods work on my dummy record, unfortunately the whole thing does not work on my real record as it is too slow.
这些方法适用于我的虚拟记录,不幸的是整个事情不适用于我的真实记录,因为它太慢了。 How can I rewrite this so that it performs faster?
我怎样才能重写它以使其执行得更快?
Dataframe数据框
customerId fromDate sales
0 1 2022-06-01 100
1 1 2022-05-25 20
2 1 2022-05-25 50
3 1 2022-05-20 30
4 1 2021-09-05 40
5 2 2022-06-02 80
6 3 2021-03-01 50
7 3 2021-02-01 20
Code代码
from datetime import datetime
from dateutil.relativedelta import relativedelta
import pandas as pd
def find_last_date(date_: datetime) -> datetime:
six_months = date_ + relativedelta(months=-6)
return six_months
def sum_func(row: pd.DataFrame, df: pd.DataFrame) -> int :
return df[
(df["customerId"] == row["customerId"])
& (row["fromDate"] + relativedelta(months=-6)<= df["fromDate"])
& (df["fromDate"] <= row["fromDate"])
]["sales"].sum()
d = {
"customerId": [1, 1, 1, 1, 1, 2, 3, 3],
"fromDate": [
"2022-06-01",
"2022-05-25",
"2022-05-25",
"2022-05-20",
"2021-09-05",
"2022-06-02",
"2021-03-01",
"2021-02-01",
],
"sales": [100, 20, 50, 30, 40, 80, 50, 20],
}
df = pd.DataFrame(data=d)
df["fromDate"] = pd.to_datetime(df["fromDate"], errors="coerce")
df["last_month"] = df["fromDate"].apply(find_last_date)
df["total_sales"]=df[["customerId", "fromDate"]].apply(lambda x: sum_func(x, df), axis=1)
print(df)
What I want我想要的是
customerId fromDate sales last_month total_sales
0 1 2022-06-01 100 2022-03-01 200 # 100 + 20 + 50 + 30
1 1 2022-05-25 20 2022-02-25 100 # 20 + 50 + 30
2 1 2022-05-25 50 2022-02-25 100 # 50 + 20 + 30
3 1 2022-05-20 30 2022-02-20 30 # 30
4 1 2021-09-05 40 2021-06-05 40 # 40
5 2 2022-06-02 80 2022-03-02 80 # 80
6 3 2021-03-01 50 2020-12-01 70 # 50 + 20
7 3 2021-02-01 20 2020-11-01 20 # 20
print(df['customerId'].value_counts().describe())
count 53979.000
mean 87.404
std 1588.450
min 1.000
25% 2.000
50% 6.000
75% 22.000
max 205284.000
print(df['fromDate'].agg((min, max)))
min 2021-02-22
max 2022-03-26
Use numpy broadcasting per groups with numpy.where
for set for True
values of Sales
and if not match 0
, so possible sum sales to new column:使用
numpy.where
为每个组使用 numpy 广播来设置Sales
的True
值,如果不匹配0
,则可能将销售总和到新列:
df["fromDate"] = pd.to_datetime(df["fromDate"], errors="coerce")
df["last_month"] = df["fromDate"] - pd.offsets.DateOffset(months=6)
def f(x):
d1 = x["fromDate"].to_numpy()
d2 = x["last_month"].to_numpy()
mask = (d2[:, None]<=d1) & (d1<=d1[:, None])
x['total_sales'] = np.dot(mask, x['sales'].to_numpy())
return x
df = df.groupby('customerId').apply(f)
print(df)
customerId fromDate sales last_month total_sales
0 1 2022-06-01 100 2021-12-01 200
1 1 2022-05-25 20 2021-11-25 100
2 1 2022-05-25 50 2021-11-25 100
3 1 2022-05-20 30 2021-11-20 30
4 1 2021-09-05 40 2021-03-05 40
5 2 2022-06-02 80 2021-12-02 80
6 3 2021-03-01 50 2020-09-01 70
7 3 2021-02-01 20 2020-08-01 20
EDIT:编辑:
df["fromDate"] = pd.to_datetime(df["fromDate"], errors="coerce")
df["last_month"] = df["fromDate"] - pd.offsets.DateOffset(months=6)
#https://stackoverflow.com/a/27670190/2901002
def chunking_dot(big_matrix, small_matrix, chunk_size=10000):
# Make a copy if the array is not already contiguous
small_matrix = np.ascontiguousarray(small_matrix)
R = np.empty((big_matrix.shape[0], small_matrix.shape[1]))
for i in range(0, R.shape[0], chunk_size):
end = i + chunk_size
R[i:end] = np.dot(big_matrix[i:end], small_matrix)
return R
def f(x):
d1 = x["fromDate"].to_numpy()
d2 = x["last_month"].to_numpy()
mask = (d2[:, None]<=d1) & (d1<=d1[:, None])
# print (mask)
x['total_sales'] = chunking_dot(mask, x[['sales']].to_numpy())
return x
df = df.groupby('customerId').apply(f)
print(df)
customerId fromDate sales last_month total_sales
0 1 2022-06-01 100 2021-12-01 200.0
1 1 2022-05-25 20 2021-11-25 100.0
2 1 2022-05-25 50 2021-11-25 100.0
3 1 2022-05-20 30 2021-11-20 30.0
4 1 2021-09-05 40 2021-03-05 40.0
5 2 2022-06-02 80 2021-12-02 80.0
6 3 2021-03-01 50 2020-09-01 70.0
7 3 2021-02-01 20 2020-08-01 20.0
Using multiprocessing and consider 6 months as 180 days to reduce the memory size and the time computing.使用多处理并将 6 个月视为 180 天,以减少内存大小和计算时间。
Copy the following code to a python file and run it from the console (not from a Jupyter Notebook)将以下代码复制到 python 文件并从控制台运行它(而不是从 Jupyter Notebook)
import pandas as pd
import numpy as np
import multiprocessing as mp
import time
def sum_sales(customer, df):
# 1st pass: sum sales of same days, reduce the row numbers
df1 = df.groupby('fromDate')['sales'].sum()
# Generate all missing dates
df1 = df1.reindex(pd.date_range(df1.index.min(), df1.index.max(), freq='D'), fill_value=0)
# 2nd pass: use a sliding window of 180 days to sum
df1 = df1.rolling(90, min_periods=0).sum().astype(int)
# Restore original index for the group
df1 = df1.reindex(df['fromDate']).reset_index(drop=True).to_frame().set_index(df.index)
return df1
if __name__ == '__main__': # Do not remove this line! Mandatory
# Setup a minimal reproducible example
N = 3_000_000
D = pd.to_datetime('2021-1-1')
rng = np.random.default_rng(2022)
dti = D + pd.to_timedelta(rng.integers(0, 365*2, N), unit='D')
cust = rng.integers(0, 75000, N)
sales = rng.integers(1, 200, N)
df = pd.DataFrame({'customerId': cust, 'fromDate': dti, 'sales': sales})
# Ensure your dataframe is sorted by fromDate for rolling window
df.sort_values(['customerId', 'fromDate'], ignore_index=True)
start = time.time()
with mp.Pool(mp.cpu_count() - 1) as p:
results = p.starmap(sum_sales, df.groupby('customerId'))
df['total_sales'] = pd.concat(results)
end = time.time()
print(f"Elapsed time: {end - start:.2f} seconds")
For 3mio records and 75k different customers on 2 years (730 days) 2 年(730 天)内 3mio 记录和 75k 不同客户
[...]$ python mp.py
Elapsed time: 24.36 seconds
However the number of sales per customer is well balanced than your:但是,每位客户的销售数量与您的:
>>> df['customerId'].value_counts().describe(percentiles=np.linspace(0, 1, 11)
count 75000.000000
mean 40.000000
std 6.349157
min 15.000000
0% 15.000000
10% 32.000000
20% 35.000000
30% 37.000000
40% 38.000000
50% 40.000000
60% 41.000000
70% 43.000000
80% 45.000000
90% 48.000000 # <- check the 90th percentile of your data
100% 73.000000
max 73.000000 # <- max transactions for a single customer
Name: customerId, dtype: float64
Because the sales are properly distributed per customer, my sample takes advantage of multiprocessing.因为每个客户的销售额是正确分配的,所以我的示例利用了多处理。 In your case, I don't think it will be the case (check the 90th percentile).
在您的情况下,我认为不会是这种情况(检查第 90 个百分位)。
The check with your dataframe:检查您的数据框:
>>> df
customerId fromDate sales total_sales
0 1 2022-06-01 100 200
1 1 2022-05-25 20 100
2 1 2022-05-25 50 100
3 1 2022-05-20 30 30
4 1 2021-09-05 40 40
5 2 2022-06-02 80 80
6 3 2021-03-01 50 70
7 3 2021-02-01 20 20
If you decide to choose to keep a variable moving window of 6 months instead of a fixed moving window of 180 days, the algorithm will me the same.如果您决定选择保留 6 个月的可变移动窗口而不是 180 天的固定移动窗口,那么算法将与我相同。 The important point in the code is to reduce the number of rows per customer.
代码中的重点是减少每个客户的行数。 In your sample, you can group the sales for a same (customer, date).
在您的示例中,您可以将销售分组为相同(客户、日期)。 The customer 1 have 2 rows for 2022-05-25 so you can sum them immediately.
客户 1 有 2 行 2022-05-25,因此您可以立即对它们求和。
IIUC, in your real data, you have a customer with 205284 sales between 2021-02-22 and 2022-03-26 (397 days), so this user has an average of 517 transactions per day (?). IIUC,在您的真实数据中,您有一个客户在 2021-02-22 和 2022-03-26(397 天)之间有 205284 笔销售额,因此该用户平均每天有 517 笔交易(?)。 If you sum sales of same days, you reduce the number of records from 205284 to 397...
如果您将同一天的销售额相加,您会将记录数从 205284 减少到 397...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.