简体   繁体   English

LocalCluster() 如何影响任务数量?

[英]How does LocalCluster() affect the number of tasks?

Do the calculations (like dask method dd.merge) need to be done inside or outside the LocalCluster?计算(如dask方法dd.merge)需要在LocalCluster内部还是外部完成? Do final calculations (like .compute) need to be done inside or outside the LocalCluster?最终计算(如 .compute)需要在 LocalCluster 内部还是外部完成?

My main question is - how does LocalCluster() affect the number of tasks?我的主要问题是 - LocalCluster() 如何影响任务数量?

I and my colleague noticed that placing dd.merge outside of LocalCLuster() downgraded the number of tasks significantly (like 10x or smth like that).我和我的同事注意到将 dd.merge 放在 LocalCLuster() 之外会显着降低任务数量(例如 10x 或类似的 smth)。 What is the reason for that?这是什么原因?

pseudo example伪例子

many tasks:许多任务:

dd.read_parquet(somewhere, index=False)

with LocalCluster(
        n_workers=8,
        processes=True,
        threads_per_worker=1,
        memory_limit="10GB",
        ip="tcp://localhost:9895",
    ) as cluster, Client(cluster) as client:
 dd.merge(smth)
 smth..to_parquet(
            somewhere, engine="fastparquet", compression="snappy"
        )

few tasks:几个任务:

dd.read_parquet(somewhere, index=False)
dd.merge(smth)

with LocalCluster(
        n_workers=8,
        processes=True,
        threads_per_worker=1,
        memory_limit="10GB",
        ip="tcp://localhost:9895",
    ) as cluster, Client(cluster) as client:
 
 smth..to_parquet(
            somewhere, engine="fastparquet", compression="snappy"
        )

The performance difference is due to the difference in the schedulers being used.性能差异是由于使用的调度程序不同。

According the the dask docs :根据 dask 文档

The dask collections each have a default scheduler每个 dask 集合都有一个默认的调度程序

dask.dataframe use the threaded scheduler by default dask.dataframe 默认使用线程调度器

The default scheduler is what is used when there is not another scheduler registered.默认调度程序是在没有其他调度程序注册时使用的。

Additionally, according to the dask distributed docs :此外,根据 dask 分发的文档

When we create a Client object it registers itself as the default Dask scheduler.当我们创建一个 Client 对象时,它会将自己注册为默认的 Dask 调度程序。 All .compute() methods will automatically start using the distributed system.所有 .compute() 方法将自动开始使用分布式系统。

So when operating within the context manager for the cluster, computations implicitly use that scheduler.因此,当在集群的上下文管理器中运行时,计算会隐式地使用该调度程序。

A couple of additional notes: It may be the case that the default scheduler is using more threads than the local cluster you are defining.一些额外的注意事项:默认调度程序使用的线程可能多于您定义的本地集群。 It is also possible that a significant difference in performance is due to the overhead of inter-process communication that is not incurred by the threaded scheduler.性能的显着差异也可能是由于线程调度程序未引起的进程间通信开销。 More information about the schedulers is available here .此处提供有关调度程序的更多信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM