简体   繁体   English

本地Dask调度程序无法连接到远程资源上的工作程序

[英]Local Dask scheduler failing to connect to workers on remote resource

Question

How do I specify the correct address of Dask workers on a remote resource to a Dask scheduler running locally? 如何在远程资源上为本地运行的Dask调度程序指定Dask工作程序的正确地址?

Situation 情况

I have a remote resource I can ssh into. 我有可以远程登录的远程资源。 There, I have a docker container that runs an image containing all the dependencies I need to run Dask, Distributed. 在那里,我有一个docker容器,该容器运行一个图像,其中包含运行分布式Dask所需的所有依赖项。

When run, the container executes the following: 运行时,容器执行以下操作:

dask-worker --nprocs 14 --nthreads 1 {inet_addr_local}:878

In the same network, but on my laptop, I run another container of the same image. 在同一个网络中,但在我的笔记本电脑上,我运行了另一个具有相同图像的容器。 In this container, I run the Dask scheduler, like so: 在此容器中,我运行Dask调度程序,如下所示:

dask-scheduler --port 8786

When I start up the scheduler, everything is fine. 当我启动调度程序时,一切都很好。 When I start up the container of workers, it seems to connect to the scheduler. 当我启动工作容器时,它似乎已连接到调度程序。 In the status I see the following: 在状态下,我看到以下内容:

Waiting to connect to: tcp://{this_matches_inet_address_of_local}:8786

On the scheduler, I see the following logged repeatedly, in a loop as it continually tries to contact/respond to each of the workers: 在调度程序上,我看到以下内容在循环中不断重复记录,因为它不断尝试与每个工作人员联系/响应:

distributed.scheduler - INFO - Remove worker tcp://172.18.0.10:41508
distributed.scheduler - INFO - Removed worker tcp://172.18.0.10:41508
distributed.scheduler - ERROR - Failed to connect to worker 'tcp://172.18.0.10:44590': Timed out trying to connect to 'tcp://172.18.0.10:44590' after 3 s: OSError: [Errno 113] No route to host

The issue (I think) can be seen here. 这个问题(我认为)可以在这里看到。 tcp://172.18.0.10 is incorrect. tcp://172.18.0.10不正确。 The workers on running on a resource db.foo.net that I can ssh into via me@db.foo.net . 我可以通过me@db.foo.net进入资源db.foo.net上运行的工作me@db.foo.net

From the scheduler container, I can see that I am able to ping db.foo.net successfully. 从调度程序容器中,可以看到我能够成功ping db.foo.net I think that the workers are assuming their address is the local address for the container they are in, and not db.foo.net . 我认为工作人员假设他们的地址是他们所在容器的本地地址,而不是db.foo.net I need to override this default as some sort of configuration for the workers. 我需要覆盖此默认值,作为工作人员的某种配置。 I thought --host tag would do it, but that causes Tornado to throw the following error: OSError: [Errno 99] Cannot assign requested address . 我以为--host标签可以做到,但是这导致Tornado抛出以下错误: OSError: [Errno 99] Cannot assign requested address

Dask workers need to be able to contact the scheduler with the address given to them. 敏捷工作者需要能够使用给他们的地址与调度程序联系。 It sounds like this isn't happening for you. 听起来这不是您要发生的事情。 This could be for many reasons associated to your network. 这可能是由于许多原因与您的网络相关联。 A couple of possibilities: 几种可能性:

  1. You've mis-typed the address (for example I noticed that you used port 878 in one place in your question and port 8786 in another) 您输入了错误的地址(例如,我注意到您在问题中的一个位置使用了端口878,在另一位置中使用了端口8786)
  2. Your network doesn't allow communication on certain ports (check with your system administrator) 您的网络不允许某些端口上的通信(请与系统管理员联系)
  3. Your docker containers aren't set up to publish ports externally (you may need to do some docker-wiring or use the host network explicitly) 您的Docker容器未设置为在外部发布端口(您可能需要做一些docker-wiring或显式使用主机网络)

Unfortunately there isn't much that Dask itself can do to help you identify these network issues. 不幸的是,Dask本身无法帮助您确定这些网络问题。 You might try running other services on the relevant ports and seeing if you can recreate the lack of connectivity with common tools like ping or python -m http.serve --port 8786 您可以尝试在相关端口上运行其他服务,并查看是否可以重新创建与pingpython -m http.serve --port 8786类的常用工具之间的连通性不足

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM