简体   繁体   中英

Local Dask scheduler failing to connect to workers on remote resource

Question

How do I specify the correct address of Dask workers on a remote resource to a Dask scheduler running locally?

Situation

I have a remote resource I can ssh into. There, I have a docker container that runs an image containing all the dependencies I need to run Dask, Distributed.

When run, the container executes the following:

dask-worker --nprocs 14 --nthreads 1 {inet_addr_local}:878

In the same network, but on my laptop, I run another container of the same image. In this container, I run the Dask scheduler, like so:

dask-scheduler --port 8786

When I start up the scheduler, everything is fine. When I start up the container of workers, it seems to connect to the scheduler. In the status I see the following:

Waiting to connect to: tcp://{this_matches_inet_address_of_local}:8786

On the scheduler, I see the following logged repeatedly, in a loop as it continually tries to contact/respond to each of the workers:

distributed.scheduler - INFO - Remove worker tcp://172.18.0.10:41508
distributed.scheduler - INFO - Removed worker tcp://172.18.0.10:41508
distributed.scheduler - ERROR - Failed to connect to worker 'tcp://172.18.0.10:44590': Timed out trying to connect to 'tcp://172.18.0.10:44590' after 3 s: OSError: [Errno 113] No route to host

The issue (I think) can be seen here. tcp://172.18.0.10 is incorrect. The workers on running on a resource db.foo.net that I can ssh into via me@db.foo.net .

From the scheduler container, I can see that I am able to ping db.foo.net successfully. I think that the workers are assuming their address is the local address for the container they are in, and not db.foo.net . I need to override this default as some sort of configuration for the workers. I thought --host tag would do it, but that causes Tornado to throw the following error: OSError: [Errno 99] Cannot assign requested address .

Dask workers need to be able to contact the scheduler with the address given to them. It sounds like this isn't happening for you. This could be for many reasons associated to your network. A couple of possibilities:

  1. You've mis-typed the address (for example I noticed that you used port 878 in one place in your question and port 8786 in another)
  2. Your network doesn't allow communication on certain ports (check with your system administrator)
  3. Your docker containers aren't set up to publish ports externally (you may need to do some docker-wiring or use the host network explicitly)

Unfortunately there isn't much that Dask itself can do to help you identify these network issues. You might try running other services on the relevant ports and seeing if you can recreate the lack of connectivity with common tools like ping or python -m http.serve --port 8786

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM