Tensorflow：在分布式培训中使用参数服务器

Question

It's not totally clear how parameter servers know what to do in a distributed tensor flow training. 尚不清楚参数服务器如何知道分布式张量流训练中的操作。

For example, in this SO question , the following code is used to configure parameter server and worker tasks: 例如，在此SO问题中，以下代码用于配置参数服务器和工作者任务：

if FLAGS.job_name == "ps":
    server.join()
elif FLAGS.job_name == "worker":
    ##some training code

How does server.join() indicate the given task should be a parameter server? server.join()如何指示给定的任务应该是参数服务器？ Is parameter serving a kind of default behavior for tasks? 参数是否充当任务的默认行为？ Is there anything else you can/should tell a parameter serving task to do? 您还能/应该告诉参数服务任务做什么？

Edit : This SO question addresses some of my question: "The logic there makes sure that Variable objects are assigned evenly to workers that act as parameter servers." 编辑：此SO问题解决了我的一些问题：“逻辑确保将Variable对象均匀分配给充当参数服务器的工作程序。” But how does a parameter server know it is a parameter server? 但是参数服务器如何知道它是参数服务器？ Is server.join() enough? server.join()足够？

Answer 1

TL;DR: TensorFlow doesn't know anything about "parameter servers", but instead it supports running graphs across multiple devices in different processes. TL; DR： TensorFlow对“参数服务器”一无所知，但它支持在不同进程中跨多个设备运行图形。 Some of these processes have devices whose names start with "/job:ps" , and these hold the variables. 其中一些进程的设备名称以"/job:ps"开头，并且其中包含变量。 The workers drive the training process, and when they run the train_op they will cause work to happen on the "/job:ps" devices, which will update the shared variables. 工人负责培训过程，当他们运行train_op ，将导致工作在"/job:ps"设备上进行，这将更新共享变量。

The server.join() method simply tells TensorFlow to block and listen for requests until the server shuts down (which currently means it blocks forever, or until you kill the process, since clean shutdown isn't currently implemented). server.join()方法只是告诉TensorFlow阻塞并监听请求，直到服务器关闭为止（这意味着它永远永久阻塞，或者直到您终止该进程为止，因为当前尚未实现干净关闭）。

In the example in my previous answer, the PS tasks are passive, and everything is controlled by the worker tasks... in ## some training code . 在我之前的答案中的示例中，PS任务是被动的，并且一切都由工作任务控制...在## some training code 。 If you split your code across multiple devices, TensorFlow will add the appropriate communication, and this extends to devices in different processes. 如果您将代码拆分到多个设备上，TensorFlow将添加适当的通信，这将扩展到不同进程中的设备。 The with tf.device(tf.train.replica_device_setter(...)): block tells TensorFlow to put each variable on a different PS task by setting its device to "/job:ps/task:{i}" (for different values of {i} , chosen in a round-robin fashion). with tf.device(tf.train.replica_device_setter(...)):块通过将其变量设置为"/job:ps/task:{i}" （针对不同的变量with tf.device(tf.train.replica_device_setter(...)):告诉TensorFlow将每个变量置于不同的PS任务上{i}值，以循环方式选择）。

When you call sess.run(train_op) , TensorFlow will run a graph that depends on and updates the variables, and includes the operations that update them. 当您调用sess.run(train_op) ，TensorFlow将运行一个依赖并更新变量的图，并包括更新变量的操作。 This part of the computation will happen on the "/job:ps" devices, so those devices will act like a parameter server. 计算的这一部分将在"/job:ps"设备上进行，因此这些设备将充当参数服务器。

Tensorflow：在分布式培训中使用参数服务器

问题描述

1 个解决方案

解决方案1
12 已采纳 2016-12-05 17:30:54

Tensorflow：在分布式培训中使用参数服务器

问题描述

1 个解决方案

解决方案1 12 已采纳 2016-12-05 17:30:54

解决方案1
12 已采纳 2016-12-05 17:30:54