[英]Dask (delayed) vs pandas/function returns
I am trying to study a little bit about dask as a solution my parallel computing over some big data I have.我正在尝试研究一些关于 dask 的内容,作为我对我拥有的一些大数据进行并行计算的解决方案。
I have a code where I check a list of transactions and extract the number of active customers on every period (an active customer is a customer that has any transaction with last 90 days).我有一个代码,我可以在其中检查交易列表并提取每个时期的活跃客户数量(活跃客户是在过去 90 天内有过任何交易的客户)。
This is the code for sample data:这是示例数据的代码:
import pandas as pd
import numpy as np
from datetime import date, timedelta, datetime
import dask.dataframe as dd
import dask
num_variables = 10000
rng = np.random.default_rng()
df = pd.DataFrame({
'id' : np.random.randint(1,999999999,num_variables),
'date' : [np.random.choice(pd.date_range(datetime(2021,6,1),datetime(2022,12,31))) for i in range(num_variables)],
'product' : [np.random.choice(['giftcards', 'afiliates']) for i in range(num_variables)],
'brand' : [np.random.choice(['brand_1', 'brand_2', 'brand_4', 'brand_6']) for i in range(num_variables)],
'gmv': rng.random(num_variables) * 100,
'revenue': rng.random(num_variables) * 100})
This is the "way 1" to execute (using pandas and simple functions)这是执行的“方式 1”(使用 pandas 和简单函数)
def active_clients(df : pd.DataFrame , date : date):
date1 = (date - timedelta(days=90))
date2 = date
clients_base = df.loc[(df['date'].dt.date >= date1) & (df['date'].dt.date <= date2),'id'].nunique()
return (date, clients_base)
months = []
results = []
dates = df.date.dt.to_period('M').drop_duplicates()
for i in dates:
test = pd.Period(i,freq='M').end_time.date()
months.append(test)
for i in months:
test = active_clients(df,i)
results.append(test)
results
The result here is a list of tuples:这里的结果是一个元组列表:
[(datetime.date(2022, 7, 31), 24),
(datetime.date(2022, 10, 31), 48),
(datetime.date(2022, 12, 31), 43),
(datetime.date(2022, 8, 31), 42),
(datetime.date(2022, 9, 30), 46),
(datetime.date(2022, 11, 30), 46),
(datetime.date(2022, 6, 30), 11)]
This is the "way 2" to execute (using dask delayed and functions)这是执行的“方式 2”(使用 dask delayed 和函数)
Now I am trying to do exactly the same by using dask delayed as a way to paralelize calculation.现在我正在尝试通过使用 dask delayed 作为并行化计算的方式来做完全相同的事情。
@dask.delayed
def active_clients(df : pd.DataFrame , date : date):
date1 = (date - timedelta(days=90))
date2 = date
clients_base = df.loc[(df['date'].dt.date >= date1) & (df['date'].dt.date <= date2),'id'].nunique()
return (date, clients_base)
months = []
results = []
dates = df.date.dt.to_period('M').drop_duplicates()
for i in dates:
test = dask.delayed(pd.Period(i,freq='M').end_time.date())
months.append(test)
for i in months:
test = dask.delayed(active_clients(df,i))
results.append(test)
resultados = dask.compute(results)
resultados:结果:
([(datetime.date(2022, 7, 31), 24),
(datetime.date(2022, 10, 31), 48),
(datetime.date(2022, 12, 31), 43),
(datetime.date(2022, 8, 31), 42),
(datetime.date(2022, 9, 30), 46),
(datetime.date(2022, 11, 30), 46),
(datetime.date(2022, 6, 30), 11)],)
The issues here are:这里的问题是:
Thanks谢谢
One quick fix to your code is to remove nested delayed
calls, as the relevant function is already decorated with delayed
so there is no need to wrap it in another delayed
:对您的代码的一个快速修复是删除嵌套的
delayed
调用,因为相关的 function 已经用delayed
装饰,因此无需将其包装在另一个delayed
中:
@dask.delayed
def active_clients(df : pd.DataFrame , date : date):
date1 = (date - timedelta(days=90))
date2 = date
clients_base = df.loc[(df['date'].dt.date >= date1) & (df['date'].dt.date <= date2),'id'].nunique()
return (date, clients_base)
months = []
results = []
dates = df.date.dt.to_period('M').drop_duplicates()
months = [pd.Period(i,freq='M').end_time.date() for i in dates]
for i in months:
test = active_clients(df,i) # note this will be delayed due to decoration of active_clients
results.append(test)
resultados = dask.compute(*results) # this will return a single list of results
The result of dask.compute
will return a tuple as the code is intended to be used with multiple delayed values, so if you unpack the list of delayeds, then the computed results will be placed in resultados
as a tuple. dask.compute
的结果将返回一个元组,因为该代码旨在与多个延迟值一起使用,因此如果您解压缩延迟列表,则计算结果将作为元组放置在resultados
中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.