简体   繁体   English

关闭Dask LocalCluster的“正确”方法是什么?

[英]What is the “right” way to close a Dask LocalCluster?

I am trying to use dask-distributed on my laptop using a LocalCluster, but I have still not found a way to let my application close without raising some warnings or triggering some strange iterations with matplotlib (I am using the tkAgg backend). 我正在尝试使用LocalCluster在笔记本电脑上使用dask-distributed,但是我仍然没有找到一种方法来关闭我的应用程序而不会引发一些警告或使用matplotlib触发一些奇怪的迭代(我正在使用tkAgg后端)。

For example, if I close both the client and the cluster in this order then tk can not remove in an appropriate way the image from the memory and I get the following error: 例如,如果我以此顺序关闭客户端和群集,则tk无法以适当的方式从内存中删除图像,并且出现以下错误:

Traceback (most recent call last):
  File "/opt/Python-3.6.0/lib/python3.6/tkinter/__init__.py", line 3501, in __del__
    self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop

For example, the following code generates this error: 例如,以下代码生成此错误:

from time import sleep
import numpy as np
import matplotlib.pyplot as plt
from dask.distributed import Client, LocalCluster

if __name__ == '__main__':
    cluster = LocalCluster(
        n_workers=2,
        processes=True,
        threads_per_worker=1
    )
    client = Client(cluster)

    x = np.linspace(0, 1, 100)
    y = x * x
    plt.plot(x, y)

    print('Computation complete! Stopping workers...')
    client.close()
    sleep(1)
    cluster.close()

    print('Execution complete!')

The sleep(1) line makes the problem more likely to appear, as it does not occur at every execution. sleep(1)行使问题更有可能出现,因为它不会在每次执行时都发生。

Any other combination that I tried to stop the execution (avoid to close the client, avoid to close the cluster, avoid to close both) generates problems with tornado, instead. 我尝试停止执行的任何其他组合(避免关闭客户端,避免关闭群集,避免关闭两个)都产生了龙卷风问题。 Usually the following 通常以下

tornado.application - ERROR - Exception in Future <Future cancelled> after timeout

What is the right combination to stop the local cluster and the client? 什么是停止本地群集和客户端的正确组合? Am I missing something? 我想念什么吗?

These are the libraries that I am using: 这些是我正在使用的库:

  • python 3.[6,7].0 python 3. [6,7] .0
  • tornado 5.1.1 龙卷风5.1.1
  • dask 0.20.0 敏捷0.20.0
  • distributed 1.24.0 分发1.24.0
  • matplotlib 3.0.1 matplotlib 3.0.1

Thank you for your help! 谢谢您的帮助!

From our experience - the best way is to use a context manager, for example: 根据我们的经验,最好的方法是使用上下文管理器,例如:

import numpy as np
import matplotlib.pyplot as plt
from dask.distributed import Client, LocalCluster 

if __name__ == '__main__':
    cluster = LocalCluster(
    n_workers=2,
    processes=True,
    threads_per_worker=1
    )
    with Client(cluster) as client:
        x = np.linspace(0, 1, 100)
        y = x * x
        plt.plot(x, y)
        print('Computation complete! Stopping workers...')

    print('Execution complete!')

Expanding on skibee's answer, here is a pattern I use. 扩展skibee的答案,这是我使用的模式。 It sets up a temporary LocalCluster and then shuts it down. 它设置一个临时LocalCluster,然后将其关闭。 Very useful when different parts of your code must be parallelized in different ways (eg one needs threads and the other needs processes). 当必须以不同的方式并行化代码的不同部分时(例如,一个需要线程,而另一个需要进程),此功能非常有用。

from dask.distributed import Client, LocalCluster
import multiprocessing as mp

with LocalCluster(n_workers=int(0.9 * mp.cpu_count()),
    processes=True,
    threads_per_worker=1,
    memory_limit='2GB',
    ip='tcp://localhost:9895',
) as cluster, Client(cluster) as client:
    # Do something using 'client'

What's happening above: 上面发生了什么:

  • A cluster is being spun up on your local machine (ie the one running the Python interpreter). 一个集群正在本地计算机上旋转(即运行Python解释器的集群)。 The scheduler of this cluster is listening on port 9895. 该群集的调度程序正在侦听端口9895。

  • The cluster is created and a number of workers are spun up. 创建集群,并启动了许多工作程序。 Each worker is a process, since I specified processes=True . 每个工作人员都是一个进程,因为我指定了processes=True

  • The number of workers spun up is 90% of the number of CPU cores, rounded down. 向上旋转的工人数量是CPU内核数量的90%,四舍五入。 So an 8-core machine will spawn 7 worker processes. 因此,一台8核计算机将产生7个工作进程。 This leaves at least one core free for SSH / Notebook server / other applications. 这为SSH /笔记本服务器/其他应用程序留出了至少一个免费的内核。

  • Each worker is initialized with 2GB of RAM. 每个工作程序都初始化有2GB的RAM。 Having a temporary cluster allows you to spin up workers with different amount of RAM for different workloads. 拥有一个临时群集可以使您为不同的工作负载增加具有不同RAM数量的工作线程。

  • Once the with block exits, both cluster.close() and client.close() are called. 一旦with块退出, cluster.close()client.close()都将被调用。 The first one closes the cluster, scehduler, nanny and all workers, and the second disconnects the client (created on your python interpreter) from the cluster. 第一个关闭集群,scehduler,nanny和所有工作程序,第二个断开客户端(在python解释器上创建)与集群的连接。

While the workets are processing, you can check if the cluster is active by checking lsof -i :9895 . 在处理工作集时,可以通过检查lsof -i :9895来检查集群是否处于活动状态。 If there is no output, the cluster has closed. 如果没有输出,则说明集群已关闭。


Sample use-case: suppose you want to use a pre-trained ML model to predict on 1,000,000 examples. 示例用例:假设您要使用预训练的ML模型来预测1,000,000个示例。

The model is optimized/vectorized such that it can predict on 10K examples pretty fast, but 1M is slow. 该模型经过优化/向量化,因此可以很快地预测出10K个示例,但慢到1M。 In such a case, a setup which works is to load the multiple copies of the model from disk, and then use them to predict on chunks of the 1M examples. 在这种情况下,有效的设置是从磁盘加载模型的多个副本,然后使用它们来预测1M示例的块。

Dask allows you to do this pretty easily and achieve a good speedup: Dask可让您轻松完成此操作并获得良好的加速效果:

def load_and_predict(input_data_chunk):
    model_path = '...' # On your disk, so accessible by all processes.
    model = some_library.load_model(model_path)
    labels, scores = model.predict(input_data_chunk, ...)
    return np.array([labels, scores])

# (not shown) Load `input_data`, a list of your 1M examples.

import dask.array as DaskArray

da_input_data = DaskArray.from_array(input_data, chunks=(10_000,))

prediction_results = None
with LocalCluster(n_workers=int(0.9 * mp.cpu_count()),
    processes=True,
    threads_per_worker=1,
    memory_limit='2GB',
    ip='tcp://localhost:9895',
) as cluster, Client(cluster) as client:
    prediction_results = da_input_data.map_blocks(load_and_predict).compute()

# Combine prediction_results, which will be a list of Numpy arrays, 
# each with labels, scores for 10,000 examples.

References: 参考文献:

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Dask - LocalCluster 的灵活 memory 分配 - Dask - flexible memory allocation for LocalCluster 在 dask localcluster 上管理工作内存 - Managing worker memory on a dask localcluster dask 执行卡在 LocalCluster - dask execution gets stuck in LocalCluster 在模块化 python 代码库中使用 Dask LocalCluster() - Using Dask LocalCluster() within a modular python codebase dask.distributed LocalCluster与线程和进程之间的区别 - Difference between dask.distributed LocalCluster with threads vs. processes 在 HPC 集群中创建 Dask LocalCluster 实例时,SLURM 任务失败 - SLURM task fails when creating an instance of the Dask LocalCluster in an HPC cluster dask.bag / dask.delayed for 循环有什么区别,为dask中的python并行作业选择更好的方法 - What is the differences between dask.bag / dask.delayed for loop, choose better way for python paralell jobs in dask 获取 dask 数据帧的分区并将其转换为 Pandas 数据帧的最佳方法是什么? - What is the best way to take a partition of a dask dataframe and convert it to a pandas dataframe? 从 CSV 读取时,在 Dask 中添加索引列的方法是什么? - What is the way to add an index column in Dask when reading from a CSV? 为什么 dask.distributed.Client 在提供已使用的已定义 LocalCluster 参数时会引发“TypeError: cannot pickle '_thread.RLock' object”? - Why does dask.distributed.Client raise “TypeError: cannot pickle '_thread.RLock' object” when provided with a used defined LocalCluster argument?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM