简体   繁体   English

dask.distributed LocalCluster与线程和进程之间的区别

[英]Difference between dask.distributed LocalCluster with threads vs. processes

What is the difference between the following LocalCluster configurations for dask.distributed ? 以下dask.distributed LocalCluster配置之间有什么区别?

Client(n_workers=4, processes=False, threads_per_worker=1)

versus

Client(n_workers=1, processes=True, threads_per_worker=4)

They both have four threads working on the task graph, but the first has four workers. 他们都有四个在任务图上工作的线程,但是第一个有四个工人。 What, then, would be the benefit of having multiple workers acting as threads as opposed to a single worker with multiple threads? 那么,让多个工作人员充当线程而不是一个工作人员具有多个线程有什么好处呢?

Edit : just a clarification, I'm aware of the difference between processes, threads and shared memory, so this question is oriented more towards the configurational differences of these two Clients. 编辑 :只是澄清,我知道进程,线程和共享内存之间的差异,因此,这个问题更多地针对这两个客户端的配置差异。

When you use processes=False you are constraining your cluster to work only through your machine architecture. 当使用process = False时,将限制群集仅通过计算机体系结构工作。

from dask.distributed import Client

# The address provided by processes=False is a In-process transport address. 
# This is used to perform communication between threads 
# Scheduler and workers are on the same machine.
client = Client(processes=False)
client
<Client: scheduler='inproc://10.0.0.168/31904/1' processes=1 cores=4>

# The address provided on processes=True is tcp protocol. 
# This is a network address. You can start workers from others machines
# just pointing the scheduler address to this tcp address 
# (All machines must be on the same network).
client = Client(processes=True)
client
<Client: scheduler='tcp://127.0.0.1:53592' processes=4 cores=4>

You are conflating a couple of different things here: 您在这里混淆了两件事:

  • the balance between number of processes and threads, with different mixtures favouring different work loads. 进程和线程数量之间的平衡,不同的混合有利于不同的工作负载。 More threads per worker mean better sharing of memory resources and avoiding serialisation; 每个工作人员更多的线程意味着更好地共享内存资源并避免了序列化; fewer threads and more processes means better avoiding of the GIL 更少的线程和更多的进程意味着更好地避免了GIL

  • with processes=False , both the scheduler and workers are run as threads within the same process as the client. 如果processes=False ,则调度程序和工作程序均作为线程在与客户端相同的进程中运行。 Again, they will share memory resources, and you will not even have to serialise object between the client and scheduler. 同样,它们将共享内存资源,您甚至不必在客户端和调度程序之间序列化对象。 However, you will have many threads, and the client may become less responsive. 但是,您将有很多线程,并且客户端的响应速度可能会降低。 This is commonly used for testing, as the scheduler and worker objects can be directly introspected. 这通常用于测试,因为可以直接自检调度程序和工作对象。

I was inspired by both Victor and Martin's answers to dig a little deeper, so here's an in-depth summary of my understanding. Victor和Martin的回答都启发了我,使我更深入地研究,因此这里是我的理解的深入总结。 (couldn't do it in a comment) (无法在评论中这样做)

First, note that the scheduler printout in this version of dask isn't quite intuitive. 首先,请注意,此版本的dask中的调度程序打印输出不是很直观。 processes is actually the number of workers, cores is actually the total number of threads in all workers. processes实际上是工作processes的数量, cores实际上是所有工作程序中线程的总数。

Secondly, Victor's comments about the TCP address and adding/connecting more workers are good to point out. 其次,值得一提的是Victor关于TCP地址和添加/连接更多工作程序的评论。 I'm not sure if more workers could be added to a cluster with processes=False , but I think the answer is probably yes. 我不确定是否可以将更多工人添加到具有processes=False的集群中,但是我认为答案可能是肯定的。

Now, consider the following script: 现在,考虑以下脚本:

from dask.distributed import Client

if __name__ == '__main__':
    with Client(processes=False) as client:  # Config 1
        print(client)
    with Client(processes=False, n_workers=4) as client:  # Config 2
        print(client)
    with Client(processes=False, n_workers=3) as client:  # Config 3
        print(client)
    with Client(processes=True) as client:  # Config 4
        print(client)
    with Client(processes=True, n_workers=3) as client:  # Config 5
        print(client)
    with Client(processes=True, n_workers=3,
                threads_per_worker=1) as client:  # Config 6
        print(client)

This produces the following output in dask version 2.3.0 for my laptop (4 cores): 这会在dask版本2.3.0中为我的笔记本电脑(4核)生成以下输出:

<Client: scheduler='inproc://90.147.106.86/14980/1' processes=1 cores=4>
<Client: scheduler='inproc://90.147.106.86/14980/9' processes=4 cores=4>
<Client: scheduler='inproc://90.147.106.86/14980/26' processes=3 cores=6>
<Client: scheduler='tcp://127.0.0.1:51744' processes=4 cores=4>
<Client: scheduler='tcp://127.0.0.1:51788' processes=3 cores=6>
<Client: scheduler='tcp://127.0.0.1:51818' processes=3 cores=3>

Here's my understanding of the differences between the configurations: 这是我对配置之间差异的理解:

  1. The scheduler and all workers are run as threads within the Client process. 调度程序和所有工作程序在客户端进程中作为线程运行。 (As Martin said, this is useful for introspection.) Because neither the number of workers or the number of threads/worker is given, dask calls its function nprocesses_nthreads() to set the defaults (with processes=False , 1 process and threads equal to available cores). (如Martin所说,这对于自省很有用。)由于既没有提供worker的数量也没有给出线程/ worker的数量, dask调用其函数nprocesses_nthreads()来设置默认值( nprocesses_nthreads() processes=False ,1个进程和线程等于到可用的核心)。
  2. Same as 1, but since n_workers was given, the threads/workers is chosen by dask such that the total number of threads is equal to the number of cores (ie, 1). 与1相同,但是由于n_workers ,所以线程/ workers是由dask选择的,以使线程总数等于核心数(即1)。 Again, processes in the print output is not exactly correct -- it's actually the number of workers (which in this case are actually threads). 同样,打印输出中的processes并不完全正确,它实际上是工作程序的数量(在这种情况下,实际上是线程)。
  3. Same as 2, but since n_workers doesn't divide equally into the number of cores, dask chooses 2 threads/worker to overcommit instead of undercommit. 与2相同,但由于n_workers不会平均划分内核数,因此dask选择2个线程/ worker过量使用而不是不足。
  4. The Client, Scheduler and all workers are separate processes. 客户端,调度程序和所有工作程序都是单独的过程。 Dask chooses the default number of workers (equal to cores because it's <= 4) and the default number of threads/worker (1). Dask选择默认数量的工作程序(等于内核,因为它<= 4)和默认线程/工作程序数(1)。
  5. Same processes/thread configuration as 5, but the total threads are overprescribed for the same reason as 3. 与5相同的进程/线程配置,但由于与3相同的原因,总线程被超额预定。
  6. This behaves as expected. 这表现出预期。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM