[英]What is the “right” way to close a Dask LocalCluster?
I am trying to use dask-distributed on my laptop using a LocalCluster, but I have still not found a way to let my application close without raising some warnings or triggering some strange iterations with matplotlib (I am using the tkAgg backend). 我正在尝试使用LocalCluster在笔记本电脑上使用dask-distributed,但是我仍然没有找到一种方法来关闭我的应用程序而不会引发一些警告或使用matplotlib触发一些奇怪的迭代(我正在使用tkAgg后端)。
For example, if I close both the client and the cluster in this order then tk can not remove in an appropriate way the image from the memory and I get the following error: 例如,如果我以此顺序关闭客户端和群集,则tk无法以适当的方式从内存中删除图像,并且出现以下错误:
Traceback (most recent call last):
File "/opt/Python-3.6.0/lib/python3.6/tkinter/__init__.py", line 3501, in __del__
self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop
For example, the following code generates this error: 例如,以下代码生成此错误:
from time import sleep
import numpy as np
import matplotlib.pyplot as plt
from dask.distributed import Client, LocalCluster
if __name__ == '__main__':
cluster = LocalCluster(
n_workers=2,
processes=True,
threads_per_worker=1
)
client = Client(cluster)
x = np.linspace(0, 1, 100)
y = x * x
plt.plot(x, y)
print('Computation complete! Stopping workers...')
client.close()
sleep(1)
cluster.close()
print('Execution complete!')
The sleep(1)
line makes the problem more likely to appear, as it does not occur at every execution. sleep(1)
行使问题更有可能出现,因为它不会在每次执行时都发生。
Any other combination that I tried to stop the execution (avoid to close the client, avoid to close the cluster, avoid to close both) generates problems with tornado, instead. 我尝试停止执行的任何其他组合(避免关闭客户端,避免关闭群集,避免关闭两个)都产生了龙卷风问题。 Usually the following
通常以下
tornado.application - ERROR - Exception in Future <Future cancelled> after timeout
What is the right combination to stop the local cluster and the client? 什么是停止本地群集和客户端的正确组合? Am I missing something?
我想念什么吗?
These are the libraries that I am using: 这些是我正在使用的库:
Thank you for your help! 谢谢您的帮助!
From our experience - the best way is to use a context manager, for example: 根据我们的经验,最好的方法是使用上下文管理器,例如:
import numpy as np
import matplotlib.pyplot as plt
from dask.distributed import Client, LocalCluster
if __name__ == '__main__':
cluster = LocalCluster(
n_workers=2,
processes=True,
threads_per_worker=1
)
with Client(cluster) as client:
x = np.linspace(0, 1, 100)
y = x * x
plt.plot(x, y)
print('Computation complete! Stopping workers...')
print('Execution complete!')
Expanding on skibee's answer, here is a pattern I use. 扩展skibee的答案,这是我使用的模式。 It sets up a temporary LocalCluster and then shuts it down.
它设置一个临时LocalCluster,然后将其关闭。 Very useful when different parts of your code must be parallelized in different ways (eg one needs threads and the other needs processes).
当必须以不同的方式并行化代码的不同部分时(例如,一个需要线程,而另一个需要进程),此功能非常有用。
from dask.distributed import Client, LocalCluster
import multiprocessing as mp
with LocalCluster(n_workers=int(0.9 * mp.cpu_count()),
processes=True,
threads_per_worker=1,
memory_limit='2GB',
ip='tcp://localhost:9895',
) as cluster, Client(cluster) as client:
# Do something using 'client'
What's happening above: 上面发生了什么:
A cluster is being spun up on your local machine (ie the one running the Python interpreter). 一个集群正在本地计算机上旋转(即运行Python解释器的集群)。 The scheduler of this cluster is listening on port 9895.
该群集的调度程序正在侦听端口9895。
The cluster is created and a number of workers are spun up. 创建集群,并启动了许多工作程序。 Each worker is a process, since I specified
processes=True
. 每个工作人员都是一个进程,因为我指定了
processes=True
。
The number of workers spun up is 90% of the number of CPU cores, rounded down. 向上旋转的工人数量是CPU内核数量的90%,四舍五入。 So an 8-core machine will spawn 7 worker processes.
因此,一台8核计算机将产生7个工作进程。 This leaves at least one core free for SSH / Notebook server / other applications.
这为SSH /笔记本服务器/其他应用程序留出了至少一个免费的内核。
Each worker is initialized with 2GB of RAM. 每个工作程序都初始化有2GB的RAM。 Having a temporary cluster allows you to spin up workers with different amount of RAM for different workloads.
拥有一个临时群集可以使您为不同的工作负载增加具有不同RAM数量的工作线程。
Once the with
block exits, both cluster.close()
and client.close()
are called. 一旦
with
块退出, cluster.close()
和client.close()
都将被调用。 The first one closes the cluster, scehduler, nanny and all workers, and the second disconnects the client (created on your python interpreter) from the cluster. 第一个关闭集群,scehduler,nanny和所有工作程序,第二个断开客户端(在python解释器上创建)与集群的连接。
While the workets are processing, you can check if the cluster is active by checking lsof -i :9895
. 在处理工作集时,可以通过检查
lsof -i :9895
来检查集群是否处于活动状态。 If there is no output, the cluster has closed. 如果没有输出,则说明集群已关闭。
Sample use-case: suppose you want to use a pre-trained ML model to predict on 1,000,000 examples. 示例用例:假设您要使用预训练的ML模型来预测1,000,000个示例。
The model is optimized/vectorized such that it can predict on 10K examples pretty fast, but 1M is slow. 该模型经过优化/向量化,因此可以很快地预测出10K个示例,但慢到1M。 In such a case, a setup which works is to load the multiple copies of the model from disk, and then use them to predict on chunks of the 1M examples.
在这种情况下,有效的设置是从磁盘加载模型的多个副本,然后使用它们来预测1M示例的块。
Dask allows you to do this pretty easily and achieve a good speedup: Dask可让您轻松完成此操作并获得良好的加速效果:
def load_and_predict(input_data_chunk):
model_path = '...' # On your disk, so accessible by all processes.
model = some_library.load_model(model_path)
labels, scores = model.predict(input_data_chunk, ...)
return np.array([labels, scores])
# (not shown) Load `input_data`, a list of your 1M examples.
import dask.array as DaskArray
da_input_data = DaskArray.from_array(input_data, chunks=(10_000,))
prediction_results = None
with LocalCluster(n_workers=int(0.9 * mp.cpu_count()),
processes=True,
threads_per_worker=1,
memory_limit='2GB',
ip='tcp://localhost:9895',
) as cluster, Client(cluster) as client:
prediction_results = da_input_data.map_blocks(load_and_predict).compute()
# Combine prediction_results, which will be a list of Numpy arrays,
# each with labels, scores for 10,000 examples.
References: 参考文献:
close
method: https://distributed.dask.org/en/latest/api.html#distributed.Client.close close
方法: https : //distributed.dask.org/en/latest/api.html#distributed.Client.close Scheduler close
method, which from my understanding is what is invoked by cluster.close()
: https://distributed.dask.org/en/latest/scheduling-state.html#distributed.scheduler.Scheduler.close Scheduler
close
方法,据我所知是cluster.close()
调用的方法: https : //distributed.dask.org/en/latest/scheduling-state.html#distributed.scheduler.Scheduler.close
with
statement having multiple variables: https://stackoverflow.com/a/1073814/4900327 with
多个变量的语句: https : //stackoverflow.com/a/1073814/4900327
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.