简体   繁体   中英

Passing dask DataFrame as argument to task

Is it best practice to pass a dd.DataFrame as an argument to a task via Client.submit to move work requiring a concretized dataframe to a worker instead of on the client? The following seems to work, though its not clear if this is the best choice:

def my_task(ddf: dd.DataFrame) -> None:
    df = ddf.compute()
    ...  # Work requiring the concrete pd.DataFrame

f = client.submit(my_task, ddf)

The only other alterative I can think of would be to repartition with a single partition and then operate.

While this is not explicitly mentioned on the list of best practices, it is probably preferable to do it as you specified.

Imagine a scenario where your client has very few resources (eg a laptop), while workers have large resources (eg they are on an HPC cluster). In this case, bringing the result of a computation to the client might not even be feasible (eg the computed dataframe is too large for the laptop, although workers can compute it).

Here's a relevant bit from the docs :

Avoid repeatedly putting large inputs into delayed calls Every time you pass a concrete result (anything that isn't delayed) Dask will hash it by default to give it a name. This is fairly fast (around 500 MB/s) but can be slow if you do it over and over again. Instead, it is better to delay your data as well. This is especially important when using a distributed cluster to avoid sending your data separately for each function call.

So passing around the dask objects (dask dataframes, delayed, future, etc.) will minimize the amount of data transfer and avoid the potential problems when client resources are minimal.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM