简体   繁体   English

Dask(延迟)与熊猫/函数返回

[英]Dask (delayed) vs pandas/function returns

I am trying to study a little bit about dask as a solution my parallel computing over some big data I have.我正在尝试研究一些关于 dask 的内容,作为我对我拥有的一些大数据进行并行计算的解决方案。

I have a code where I check a list of transactions and extract the number of active customers on every period (an active customer is a customer that has any transaction with last 90 days).我有一个代码,我可以在其中检查交易列表并提取每个时期的活跃客户数量(活跃客户是在过去 90 天内有过任何交易的客户)。

This is the code for sample data:这是示例数据的代码:

import pandas as pd
import numpy as np
from datetime import date, timedelta, datetime
import dask.dataframe as dd
import dask 

num_variables = 10000
rng = np.random.default_rng()

df = pd.DataFrame({
    'id' :  np.random.randint(1,999999999,num_variables),
    'date' : [np.random.choice(pd.date_range(datetime(2021,6,1),datetime(2022,12,31))) for i in range(num_variables)],
    'product' : [np.random.choice(['giftcards', 'afiliates']) for i in range(num_variables)],
    'brand' : [np.random.choice(['brand_1', 'brand_2', 'brand_4', 'brand_6']) for i in range(num_variables)],
    'gmv': rng.random(num_variables) * 100,
    'revenue': rng.random(num_variables) * 100})

This is the "way 1" to execute (using pandas and simple functions)这是执行的“方式 1”(使用 pandas 和简单函数)

def active_clients(df : pd.DataFrame , date : date):
    date1 = (date - timedelta(days=90))
    date2 = date
    clients_base = df.loc[(df['date'].dt.date >= date1) & (df['date'].dt.date <= date2),'id'].nunique()
    return (date, clients_base)

months = []
results = []

dates = df.date.dt.to_period('M').drop_duplicates()
for i in dates:
    test = pd.Period(i,freq='M').end_time.date()
    months.append(test)

for i in months:
    test = active_clients(df,i)
    results.append(test)

results

The result here is a list of tuples:这里的结果是一个元组列表:

[(datetime.date(2022, 7, 31), 24),
 (datetime.date(2022, 10, 31), 48),
 (datetime.date(2022, 12, 31), 43),
 (datetime.date(2022, 8, 31), 42),
 (datetime.date(2022, 9, 30), 46),
 (datetime.date(2022, 11, 30), 46),
 (datetime.date(2022, 6, 30), 11)]

This is the "way 2" to execute (using dask delayed and functions)这是执行的“方式 2”(使用 dask delayed 和函数)

Now I am trying to do exactly the same by using dask delayed as a way to paralelize calculation.现在我正在尝试通过使用 dask delayed 作为并行化计算的方式来做完全相同的事情。

@dask.delayed
def active_clients(df : pd.DataFrame , date : date):
    date1 = (date - timedelta(days=90))
    date2 = date
    clients_base = df.loc[(df['date'].dt.date >= date1) & (df['date'].dt.date <= date2),'id'].nunique()
    return (date, clients_base)

months = []
results = []

dates = df.date.dt.to_period('M').drop_duplicates()
for i in dates:
    test = dask.delayed(pd.Period(i,freq='M').end_time.date())
    months.append(test)

for i in months:
    test = dask.delayed(active_clients(df,i))
    results.append(test)

resultados = dask.compute(results)

resultados:结果:

([(datetime.date(2022, 7, 31), 24),
  (datetime.date(2022, 10, 31), 48),
  (datetime.date(2022, 12, 31), 43),
  (datetime.date(2022, 8, 31), 42),
  (datetime.date(2022, 9, 30), 46),
  (datetime.date(2022, 11, 30), 46),
  (datetime.date(2022, 6, 30), 11)],)

The issues here are:这里的问题是:

  1. the code above returns me a tuple of a list of a tuple (different from the other code)上面的代码返回一个元组列表的元组(不同于其他代码)
  2. It does not seen to parallelize as only of one cores seems to be under hard work.它没有看到并行化,因为只有一个内核似乎正在努力工作。 What am I doing wrong?我究竟做错了什么?

Thanks谢谢

One quick fix to your code is to remove nested delayed calls, as the relevant function is already decorated with delayed so there is no need to wrap it in another delayed :对您的代码的一个快速修复是删除嵌套的delayed调用,因为相关的 function 已经用delayed装饰,因此无需将其包装在另一个delayed中:

@dask.delayed
def active_clients(df : pd.DataFrame , date : date):
    date1 = (date - timedelta(days=90))
    date2 = date
    clients_base = df.loc[(df['date'].dt.date >= date1) & (df['date'].dt.date <= date2),'id'].nunique()
    return (date, clients_base)

months = []
results = []

dates = df.date.dt.to_period('M').drop_duplicates()
months = [pd.Period(i,freq='M').end_time.date() for i in dates]

for i in months:
    test = active_clients(df,i)  # note this will be delayed due to decoration of active_clients
    results.append(test)

resultados = dask.compute(*results)  # this will return a single list of results

The result of dask.compute will return a tuple as the code is intended to be used with multiple delayed values, so if you unpack the list of delayeds, then the computed results will be placed in resultados as a tuple. dask.compute的结果将返回一个元组,因为该代码旨在与多个延迟值一起使用,因此如果您解压缩延迟列表,则计算结果将作为元组放置在resultados中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM