How to use dask.delayed correctly

Question

I did a timing experiment and I don't believe I'm using dask.delayed correctly. Here is the code:

import pandas as pd
import dask
import time

def my_operation(row_str: str):
    text_to_add = 'Five Michigan State University students—Ash Williams, his girlfriend, Linda; his sister, Cheryl; their friend Scott; and Scotts girlfriend Shelly—vacation at an isolated cabin in rural Tennessee. Approaching the cabin, the group notices the porch swing move on its own but suddenly stop as Scott grabs the doorknob. While Cheryl draws a picture of a clock, the clock stops, and she hears a faint, demonic voice tell her to "join us". Her hand becomes possessed, turns pale and draws a picture of a book with a demonic face on its cover. Although shaken, she does not mention the incident.'
    new_str = row_str + ' ' + text_to_add
    return new_str

def gen_sequential(n_rows: int):
    df = pd.read_csv('path/to/myfile.csv', nrows=n_rows)
    results_list = []
    tic = time.perf_counter()
    for ii in range(df.shape[0]):
        my_new_str = my_operation(df.iloc[ii, 0])
        results_list.append(my_new_str)
    toc = time.perf_counter()
    task_time = toc - tic
    return results_list, task_time

def gen_pandas_apply(n_rows: int):
    df = pd.read_csv('path/to/myfile.csv', nrows=n_rows)
    tic = time.perf_counter()
    df['gen'] = df['text'].apply(my_operation)
    toc = time.perf_counter()
    task_time = toc - tic
    return df, task_time

def gen_dask_compute(n_rows: int):
    df = pd.read_csv('path/to/myfile.csv', nrows=n_rows)
    results_list = []
    tic = time.perf_counter()
    for ii in range(df.shape[0]):
        my_new_str = dask.delayed(my_operation)(df.iloc[ii, 0])
        results_list.append(my_new_str)

    results_list = dask.compute(*results_list)
    toc = time.perf_counter()
    task_time = toc-tic
    return results_list, task_time

n_rows = 16
times = []
for ii in range(100):
    #_, t_dask_task = gen_sequential(n_rows)
    #_, t_dask_task = gen_pandas_apply(n_rows)
    _, t_dask_task = gen_dask_compute(n_rows)
    times.append(t_dask_task)
t_mean = sum(times)/len(times)
print('average time for 100 iterations: {}'.format(t_mean))

I ran the test for 8, 64, 256, 1024, 32768, 262144, and 1048576 rows in my file (which is just about 2 million rows of text) and compared it to gen_sequential() and gen_pandas_apply() . Here are the results:

n_rows    sequential[s]        pandas_apply[s]       dask_compute[s]
===========================================================================
8         0.000288928459959    0.001460871489944     0.002077747459807
---------------------------------------------------------------------------
64        0.001723313619877    0.001805401749916     0.011105699519758
---------------------------------------------------------------------------
256       0.006383508619801    0.00198456062968      0.046899785500136
---------------------------------------------------------------------------
1024      0.022589521310038    0.002799118410258     0.197301750000333
---------------------------------------------------------------------------
32768     0.63460024946984     0.035047864249209     5.91377260136054
---------------------------------------------------------------------------
262144    5.28406698709983     0.254192861450574     50.5853837806704
---------------------------------------------------------------------------
1048576   21.1142608421401     0.967728560800169     195.71797474096
---------------------------------------------------------------------------

I don't think I'm using dask.delayed properly, as for larger n_rows the average time to compute takes longer than the other methods. I would expect the big advantage of dask.delayed becomes apparent the larger the data set. Does anyone know where I am going wrong? Here is my setup:

python: 3.7.6
dask: 2.11.0
pandas: 1.0.5
OS: Pop_OS. 20.04 LTS
Virtual machine with 3 cores and 32GB memory

I'm currently reading into Vaex , but at the moment I am confined to using dask for this project. Thanks in advance for your help!

Answer 1

The time it takes for my_operation to run is minuscule per row. Even with the "threaded" scheduler, Dask adds overhead per task, and indeed python's GIL means that non-vectorised operations like this cannot actually run in parallel.

Just as you should avoid iterating a pandas dataframe, you should really avoid iterating it, and dispatching every row for dask to work on.

Did you know that Dask had a pandas-like dataframe API? You could do:

import dask.dataframe as dd
df = dd.read_csv('path/to/myfile.csv')
out = df['text'].map(my_operation)

But remember: pandas is fast and efficient, so breaking your work into blocks for Dask will generally not be faster for something that fits in memory, especially if you are outputting data as big as the input (as opposed to aggregating).

How to use dask.delayed correctly

Question

1 answers

solution1
0 ACCPTED 2020-09-23 17:20:58

How to use dask.delayed correctly

Question

1 answers

solution1 0 ACCPTED 2020-09-23 17:20:58

solution1
0 ACCPTED 2020-09-23 17:20:58