Passing dask DataFrame as argument to task

Question

Is it best practice to pass a dd.DataFrame as an argument to a task via Client.submit to move work requiring a concretized dataframe to a worker instead of on the client? The following seems to work, though its not clear if this is the best choice:

def my_task(ddf: dd.DataFrame) -> None:
    df = ddf.compute()
    ...  # Work requiring the concrete pd.DataFrame

f = client.submit(my_task, ddf)

The only other alterative I can think of would be to repartition with a single partition and then operate.

Answer 1

While this is not explicitly mentioned on the list of best practices, it is probably preferable to do it as you specified.

Imagine a scenario where your client has very few resources (eg a laptop), while workers have large resources (eg they are on an HPC cluster). In this case, bringing the result of a computation to the client might not even be feasible (eg the computed dataframe is too large for the laptop, although workers can compute it).

Here's a relevant bit from the docs :

Avoid repeatedly putting large inputs into delayed calls Every time you pass a concrete result (anything that isn't delayed) Dask will hash it by default to give it a name. This is fairly fast (around 500 MB/s) but can be slow if you do it over and over again. Instead, it is better to delay your data as well. This is especially important when using a distributed cluster to avoid sending your data separately for each function call.

So passing around the dask objects (dask dataframes, delayed, future, etc.) will minimize the amount of data transfer and avoid the potential problems when client resources are minimal.

Passing dask DataFrame as argument to task

Question

1 answers

solution1
0 2021-03-24 12:15:18

Passing dask DataFrame as argument to task

Question

1 answers

solution1 0 2021-03-24 12:15:18

solution1
0 2021-03-24 12:15:18