分布式Tensorflow，Master在培训期间卡住了，工人在使用SyncReplicasOptimizer和MonitoredTrainingSession时没有开始培训？

Question

I am trying to write a synchronous training code in distributed tensorflow using SyncReplicaOptimizer and MonitoredTraining Session. 我正在尝试使用SyncReplicaOptimizer和MonitoredTraining Session在分布式tensorflow中编写同步训练代码。

The problem I am facing is that the master hangs up the training in between after some steps and none of the workers start training. 我面临的问题是，主人经过一些步骤后就停止了培训，而没有一个工人开始培训。 Has any one encountered this before? 有人遇到过吗？

This is the code I have written. 这是我编写的代码。 Data is read from tensor flow records. 从张量流记录中读取数据。 I have followed the exact way described in the tensorflow website. 我遵循了tensorflow网站中描述的确切方法。

def build(self):
    self.modelObj = Model(self.imagesize, self.targetSize)
    self.modelObj.model()
    self.global_step = tf.contrib.framework.get_or_create_global_step()
    self.opt = tf.train.AdamOptimizer(self.learningrate)
    if self.syncTraining:
        self.trainer = tf.train.SyncReplicasOptimizer(self.opt,replicas_to_aggregate=self.num_workers,total_num_replicas=self.num_workers)
    else:
        self.trainer = self.opt
    self.trainstep = self.trainer.minimize(self.modelObj.loss, global_step=self.global_step)
    self.saver = tf.train.Saver(max_to_keep=1)
    self.summary_op = tf.summary.merge_all()
    self.init_op = tf.global_variables_initializer()
    if self.syncTraining:
        self.sync_replicas_hook = self.trainer.make_session_run_hook(is_chief = (self.task_index==0))


def train(self):
    if self.syncTraining:



        with tf.train.MonitoredTrainingSession(master=self.server.target,
                                               is_chief=(self.task_index==0),
                                               checkpoint_dir=self.logdir,
                                               hooks=[self.sync_replicas_hook]) as self.session:
            step = 0
            try:
                while not self.session.should_stop():
                    # training

                    [trainx, trainy_] = self.session.run([self.trainx, self.trainy_])
                    feed = {self.modelObj.x: trainx, self.modelObj.y_: trainy_,
                            self.modelObj.batch: self.batch_size, self.modelObj.keep_prob: 0.7}
                    _, trainloss = self.session.run([self.trainstep, self.modelObj.loss], feed_dict=feed)

                    print("step: %d, training loss %f" % (step, trainloss))

                    step += 1

            except tf.errors.OutOfRangeError:
                print('training finished, number of epochs reached')

Answer 1

Yes, the ps should not be placed on gpu. 是的，ps不应放在gpu上。 I also had this question. 我也有这个问题。 I solve it by explicitly claim ps_device="/job:ps/cpu:0" in tf.train.replica_device_setter. 我通过在tf.train.replica_device_setter中明确声明ps_device =“ / job：ps / cpu：0”来解决此问题。 The whole code is like: 整个代码如下：

with tf.device(tf.train.replica_device_setter(
                                 ps_device="/job:ps/cpu:0",
                                 worker_device="/job:worker/task:%d" % (worker_index),
                                 cluster=cluster_spec)):

Many thanks to @prateek agrawal 非常感谢@prateek agrawal

Answer 2

Found a solution. 找到了解决方案。

Delay the start of chief worker by adding 通过增加延迟首席工人的开始

time.sleep(5)

Also, do the same for parameter server and try running the parameter server on CPU instead of GPU. 另外，对参数服务器执行相同的操作，然后尝试在CPU而不是GPU上运行参数服务器。

分布式Tensorflow，Master在培训期间卡住了，工人在使用SyncReplicasOptimizer和MonitoredTrainingSession时没有开始培训？

问题描述

2 个解决方案

解决方案1
2 2017-10-17 06:51:12

解决方案2
1 2017-10-06 22:05:12

分布式Tensorflow，Master在培训期间卡住了，工人在使用SyncReplicasOptimizer和MonitoredTrainingSession时没有开始培训？

问题描述

2 个解决方案

解决方案1 2 2017-10-17 06:51:12

解决方案2 1 2017-10-06 22:05:12

解决方案1
2 2017-10-17 06:51:12

解决方案2
1 2017-10-06 22:05:12