关闭Dask LocalCluster的“正确”方法是什么？

Question

I am trying to use dask-distributed on my laptop using a LocalCluster, but I have still not found a way to let my application close without raising some warnings or triggering some strange iterations with matplotlib (I am using the tkAgg backend). 我正在尝试使用LocalCluster在笔记本电脑上使用dask-distributed，但是我仍然没有找到一种方法来关闭我的应用程序而不会引发一些警告或使用matplotlib触发一些奇怪的迭代（我正在使用tkAgg后端）。

For example, if I close both the client and the cluster in this order then tk can not remove in an appropriate way the image from the memory and I get the following error: 例如，如果我以此顺序关闭客户端和群集，则tk无法以适当的方式从内存中删除图像，并且出现以下错误：

Traceback (most recent call last):
  File "/opt/Python-3.6.0/lib/python3.6/tkinter/__init__.py", line 3501, in __del__
    self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop

For example, the following code generates this error: 例如，以下代码生成此错误：

from time import sleep
import numpy as np
import matplotlib.pyplot as plt
from dask.distributed import Client, LocalCluster

if __name__ == '__main__':
    cluster = LocalCluster(
        n_workers=2,
        processes=True,
        threads_per_worker=1
    )
    client = Client(cluster)

    x = np.linspace(0, 1, 100)
    y = x * x
    plt.plot(x, y)

    print('Computation complete! Stopping workers...')
    client.close()
    sleep(1)
    cluster.close()

    print('Execution complete!')

The sleep(1) line makes the problem more likely to appear, as it does not occur at every execution. sleep(1)行使问题更有可能出现，因为它不会在每次执行时都发生。

Any other combination that I tried to stop the execution (avoid to close the client, avoid to close the cluster, avoid to close both) generates problems with tornado, instead. 我尝试停止执行的任何其他组合（避免关闭客户端，避免关闭群集，避免关闭两个）都产生了龙卷风问题。 Usually the following 通常以下

tornado.application - ERROR - Exception in Future <Future cancelled> after timeout

What is the right combination to stop the local cluster and the client? 什么是停止本地群集和客户端的正确组合？ Am I missing something? 我想念什么吗？

These are the libraries that I am using: 这些是我正在使用的库：

python 3.[6,7].0 python 3. [6,7] .0
tornado 5.1.1 龙卷风5.1.1
dask 0.20.0 敏捷0.20.0
distributed 1.24.0 分发1.24.0
matplotlib 3.0.1 matplotlib 3.0.1

Thank you for your help! 谢谢您的帮助！

Answer 1

From our experience - the best way is to use a context manager, for example: 根据我们的经验，最好的方法是使用上下文管理器，例如：

import numpy as np
import matplotlib.pyplot as plt
from dask.distributed import Client, LocalCluster 

if __name__ == '__main__':
    cluster = LocalCluster(
    n_workers=2,
    processes=True,
    threads_per_worker=1
    )
    with Client(cluster) as client:
        x = np.linspace(0, 1, 100)
        y = x * x
        plt.plot(x, y)
        print('Computation complete! Stopping workers...')

    print('Execution complete!')

Answer 2

Expanding on skibee's answer, here is a pattern I use. 扩展skibee的答案，这是我使用的模式。 It sets up a temporary LocalCluster and then shuts it down. 它设置一个临时LocalCluster，然后将其关闭。 Very useful when different parts of your code must be parallelized in different ways (eg one needs threads and the other needs processes). 当必须以不同的方式并行化代码的不同部分时（例如，一个需要线程，而另一个需要进程），此功能非常有用。

from dask.distributed import Client, LocalCluster
import multiprocessing as mp

with LocalCluster(n_workers=int(0.9 * mp.cpu_count()),
    processes=True,
    threads_per_worker=1,
    memory_limit='2GB',
    ip='tcp://localhost:9895',
) as cluster, Client(cluster) as client:
    # Do something using 'client'

What's happening above: 上面发生了什么：

A cluster is being spun up on your local machine (ie the one running the Python interpreter). 一个集群正在本地计算机上旋转（即运行Python解释器的集群）。 The scheduler of this cluster is listening on port 9895. 该群集的调度程序正在侦听端口9895。
The cluster is created and a number of workers are spun up. 创建集群，并启动了许多工作程序。 Each worker is a process, since I specified processes=True . 每个工作人员都是一个进程，因为我指定了processes=True 。
The number of workers spun up is 90% of the number of CPU cores, rounded down. 向上旋转的工人数量是CPU内核数量的90％，四舍五入。 So an 8-core machine will spawn 7 worker processes. 因此，一台8核计算机将产生7个工作进程。 This leaves at least one core free for SSH / Notebook server / other applications. 这为SSH /笔记本服务器/其他应用程序留出了至少一个免费的内核。
Each worker is initialized with 2GB of RAM. 每个工作程序都初始化有2GB的RAM。 Having a temporary cluster allows you to spin up workers with different amount of RAM for different workloads. 拥有一个临时群集可以使您为不同的工作负载增加具有不同RAM数量的工作线程。
Once the with block exits, both cluster.close() and client.close() are called. 一旦with块退出， cluster.close()和client.close()都将被调用。 The first one closes the cluster, scehduler, nanny and all workers, and the second disconnects the client (created on your python interpreter) from the cluster. 第一个关闭集群，scehduler，nanny和所有工作程序，第二个断开客户端（在python解释器上创建）与集群的连接。

While the workets are processing, you can check if the cluster is active by checking lsof -i :9895 . 在处理工作集时，可以通过检查lsof -i :9895来检查集群是否处于活动状态。 If there is no output, the cluster has closed. 如果没有输出，则说明集群已关闭。

Sample use-case: suppose you want to use a pre-trained ML model to predict on 1,000,000 examples. 示例用例：假设您要使用预训练的ML模型来预测1,000,000个示例。

The model is optimized/vectorized such that it can predict on 10K examples pretty fast, but 1M is slow. 该模型经过优化/向量化，因此可以很快地预测出10K个示例，但慢到1M。 In such a case, a setup which works is to load the multiple copies of the model from disk, and then use them to predict on chunks of the 1M examples. 在这种情况下，有效的设置是从磁盘加载模型的多个副本，然后使用它们来预测1M示例的块。

Dask allows you to do this pretty easily and achieve a good speedup: Dask可让您轻松完成此操作并获得良好的加速效果：

def load_and_predict(input_data_chunk):
    model_path = '...' # On your disk, so accessible by all processes.
    model = some_library.load_model(model_path)
    labels, scores = model.predict(input_data_chunk, ...)
    return np.array([labels, scores])

# (not shown) Load `input_data`, a list of your 1M examples.

import dask.array as DaskArray

da_input_data = DaskArray.from_array(input_data, chunks=(10_000,))

prediction_results = None
with LocalCluster(n_workers=int(0.9 * mp.cpu_count()),
    processes=True,
    threads_per_worker=1,
    memory_limit='2GB',
    ip='tcp://localhost:9895',
) as cluster, Client(cluster) as client:
    prediction_results = da_input_data.map_blocks(load_and_predict).compute()

# Combine prediction_results, which will be a list of Numpy arrays, 
# each with labels, scores for 10,000 examples.

References: 参考文献：

Setting up a local cluster: https://distributed.dask.org/en/latest/local-cluster.html 设置本地集群： https : //distributed.dask.org/en/latest/local-cluster.html
Client close method: https://distributed.dask.org/en/latest/api.html#distributed.Client.close 客户端close方法： https : //distributed.dask.org/en/latest/api.html#distributed.Client.close
Scheduler close method, which from my understanding is what is invoked by cluster.close() : https://distributed.dask.org/en/latest/scheduling-state.html#distributed.scheduler.Scheduler.close Scheduler close方法，据我所知是cluster.close()调用的方法： https : //distributed.dask.org/en/latest/scheduling-state.html#distributed.scheduler.Scheduler.close
with statement having multiple variables: https://stackoverflow.com/a/1073814/4900327 with多个变量的语句： https : //stackoverflow.com/a/1073814/4900327

关闭Dask LocalCluster的“正确”方法是什么？

问题描述

2 个解决方案

解决方案1
2 2019-04-24 06:53:24

解决方案2
0 2019-11-24 07:59:57

关闭Dask LocalCluster的“正确”方法是什么？

问题描述

2 个解决方案

解决方案1 2 2019-04-24 06:53:24

解决方案2 0 2019-11-24 07:59:57

解决方案1
2 2019-04-24 06:53:24

解决方案2
0 2019-11-24 07:59:57