简体   繁体   English

将 dask DataFrame 作为参数传递给任务

[英]Passing dask DataFrame as argument to task

Is it best practice to pass a dd.DataFrame as an argument to a task via Client.submit to move work requiring a concretized dataframe to a worker instead of on the client?最佳实践是通过Client.submitdd.DataFrame作为参数传递给任务,以将需要具体化 dataframe 的工作转移给工作人员而不是客户端? The following seems to work, though its not clear if this is the best choice:以下似乎可行,但尚不清楚这是否是最佳选择:

def my_task(ddf: dd.DataFrame) -> None:
    df = ddf.compute()
    ...  # Work requiring the concrete pd.DataFrame

f = client.submit(my_task, ddf)

The only other alterative I can think of would be to repartition with a single partition and then operate.我能想到的唯一其他替代方法是使用单个分区重新分区,然后运行。

While this is not explicitly mentioned on the list of best practices, it is probably preferable to do it as you specified.虽然最佳实践列表中没有明确提及这一点,但最好按照您指定的方式进行。

Imagine a scenario where your client has very few resources (eg a laptop), while workers have large resources (eg they are on an HPC cluster).想象一个场景,您的客户端资源很少(例如笔记本电脑),而工作人员拥有大量资源(例如,它们位于 HPC 集群上)。 In this case, bringing the result of a computation to the client might not even be feasible (eg the computed dataframe is too large for the laptop, although workers can compute it).在这种情况下,将计算结果带到客户端甚至可能是不可行的(例如,计算的 dataframe 对于笔记本电脑来说太大了,尽管工作人员可以计算它)。

Here's a relevant bit from the docs :这是文档中的相关内容:

Avoid repeatedly putting large inputs into delayed calls Every time you pass a concrete result (anything that isn't delayed) Dask will hash it by default to give it a name.避免重复将大量输入放入延迟调用中 每次传递具体结果(任何未延迟的内容)时,Dask 都会默认为它命名。 This is fairly fast (around 500 MB/s) but can be slow if you do it over and over again.这是相当快的(大约 500 MB/s),但如果你一遍又一遍地这样做可能会很慢。 Instead, it is better to delay your data as well.相反,最好也延迟您的数据。 This is especially important when using a distributed cluster to avoid sending your data separately for each function call.这在使用分布式集群避免为每个 function 调用单独发送数据时尤其重要。

So passing around the dask objects (dask dataframes, delayed, future, etc.) will minimize the amount of data transfer and avoid the potential problems when client resources are minimal.因此,传递 dask 对象(dask 数据帧、延迟、未来等)将最大限度地减少数据传输量并避免客户端资源最少时的潜在问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM