How to use Distributed Tensorflow on remote machines?

Question

I am trying to run a distributed Tensorflow script across three machines: my local machine running the parameter server and two remote machines I have access to running worker jobs. I am following this example from the Tensorflow documentation, passing the IP addresses and unique port numbers to each worker job, and setting the protocol option in tf.train.Server to 'grpc' . However, when I run the script, all three processes are started on my localhost, and none of the jobs are on the remote machines. Is there a step I am missing?

My (abridged) code:

# Define flags
tf.app.flags.DEFINE_string("ps_hosts", "localhost:2223", 
                        "comma-separated list of hostname:port pairs")
tf.app.flags.DEFINE_string("worker_hosts", 
"server1.com:2224,server2.com:2225",
                        "comma-separated list of hostname:port pairs")

tf.app.flags.DEFINE_string("job_name", "worker", "One of 'ps', 'worker'")
tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job")

ps_hosts = FLAGS.ps_hosts.split(",")
worker_hosts = FLAGS.worker_hosts.split(",")

cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index, protocol='grpc')

if FLAGS.job_name == "ps":
    server.join()
elif FLAGS.job_name == "worker":
    # Between-graph replication
    with tf.device(tf.train.replica_device_setter(cluster=cluster, worker_device="/job:worker/task:{}".format(FLAGS.task_index))):
        # Create model...
        sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
                                 logdir="./checkpoint",
                                 init_op=init_op,
                                 summary_op=summary,
                                 saver=saver,
                                 global_step=global_step,
                                 save_model_secs=600)

        with sv.managed_session(server.target, 
                                 config=config_proto) as sess:
            # Train model...

This code causes two problems:

Both of the worker jobs give errors about not getting a response from the other:

From worker0:

2018-04-09 23:48:39.749679: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1

From worker1:

2018-04-09 23:49:30.439166: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0

I can get rid of the earlier problem by using a device_filter , but all jobs are started on my local machine, and not on the remote servers.

How do I get the two worker jobs to run on the remote servers?

Answer 1

My understanding is that you have to run this script on all hosts of your cluster. With

"--job_name=ps" arguments on parameter server and

"--job_name=worker --task_index=[0,1]" on workers.

How to use Distributed Tensorflow on remote machines?

Question

1 answers

solution1
0 2018-05-09 14:45:31

How to use Distributed Tensorflow on remote machines?

Question

1 answers

solution1 0 2018-05-09 14:45:31

solution1
0 2018-05-09 14:45:31