简体   繁体   English

Tensorflow:在分布式培训中使用参数服务器

[英]Tensorflow: Using Parameter Servers in Distributed Training

It's not totally clear how parameter servers know what to do in a distributed tensor flow training. 尚不清楚参数服务器如何知道分布式张量流训练中的操作。

For example, in this SO question , the following code is used to configure parameter server and worker tasks: 例如,在此SO问题中 ,以下代码用于配置参数服务器和工作者任务:

if FLAGS.job_name == "ps":
    server.join()
elif FLAGS.job_name == "worker":
    ##some training code

How does server.join() indicate the given task should be a parameter server? server.join()如何指示给定的任务应该是参数服务器? Is parameter serving a kind of default behavior for tasks? 参数是否充当任务的默认行为? Is there anything else you can/should tell a parameter serving task to do? 您还能/应该告诉参数服务任务做什么?

Edit : This SO question addresses some of my question: "The logic there makes sure that Variable objects are assigned evenly to workers that act as parameter servers." 编辑 :此SO问题解决了我的一些问题:“逻辑确保将Variable对象均匀分配给充当参数服务器的工作程序。” But how does a parameter server know it is a parameter server? 但是参数服务器如何知道它是参数服务器? Is server.join() enough? server.join()足够?

TL;DR: TensorFlow doesn't know anything about "parameter servers", but instead it supports running graphs across multiple devices in different processes. TL; DR: TensorFlow对“参数服务器”一无所知,但它支持在不同进程中跨多个设备运行图形。 Some of these processes have devices whose names start with "/job:ps" , and these hold the variables. 其中一些进程的设备名称以"/job:ps"开头,并且其中包含变量。 The workers drive the training process, and when they run the train_op they will cause work to happen on the "/job:ps" devices, which will update the shared variables. 工人负责培训过程,当他们运行train_op ,将导致工作在"/job:ps"设备上进行,这将更新共享变量。

The server.join() method simply tells TensorFlow to block and listen for requests until the server shuts down (which currently means it blocks forever, or until you kill the process, since clean shutdown isn't currently implemented). server.join()方法只是告诉TensorFlow阻塞并监听请求,直到服务器关闭为止(这意味着它永远永久阻塞,或者直到您终止该进程为止,因为当前尚未实现干净关闭)。

In the example in my previous answer, the PS tasks are passive, and everything is controlled by the worker tasks... in ## some training code . 在我之前的答案中的示例中,PS任务是被动的,并且一切都由工作任务控制...在## some training code If you split your code across multiple devices, TensorFlow will add the appropriate communication, and this extends to devices in different processes. 如果您将代码拆分到多个设备上,TensorFlow将添加适当的通信,这将扩展到不同进程中的设备。 The with tf.device(tf.train.replica_device_setter(...)): block tells TensorFlow to put each variable on a different PS task by setting its device to "/job:ps/task:{i}" (for different values of {i} , chosen in a round-robin fashion). with tf.device(tf.train.replica_device_setter(...)):块通过将其变量设置为"/job:ps/task:{i}" (针对不同的变量with tf.device(tf.train.replica_device_setter(...)):告诉TensorFlow将每个变量置于不同的PS任务上{i}值,以循环方式选择)。

When you call sess.run(train_op) , TensorFlow will run a graph that depends on and updates the variables, and includes the operations that update them. 当您调用sess.run(train_op) ,TensorFlow将运行一个依赖并更新变量的图,并包括更新变量的操作。 This part of the computation will happen on the "/job:ps" devices, so those devices will act like a parameter server. 计算的这一部分将在"/job:ps"设备上进行,因此这些设备将充当参数服务器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 Tensorflow 进行自定义损失的分布式训练? - How to use distributed training with a custom loss using Tensorflow? 张量流分布式培训中的FLAGS和解析器 - FLAGS and parsers in tensorflow distributed training Tensorflow输入管道用于分布式培训 - Tensorflow input pipeline for distributed training 分布式Tensorflow,Master在培训期间卡住了,工人在使用SyncReplicasOptimizer和MonitoredTrainingSession时没有开始培训? - Distributed Tensorflow, Master stuck while training, workers do not start training, while using SyncReplicasOptimizer and MonitoredTrainingSession? 异步培训如何在分布式Tensorflow中工作? - How does asynchronous training work in distributed Tensorflow? Tensorflow 每个 epoch 后的分布式训练暂停 - Tensorflow distributed training pause after each epoch Tensorflow 分布式训练挂起并出现 RecvTensor 取消警告 - Tensorflow distributed training hangs with RecvTensor cancelled warning 使用 TF Estimator 时 Tensorflow 分布式训练的损失和学习率缩放策略 - Loss and learning rate scaling strategies for Tensorflow distributed training when using TF Estimator tensorflow:如何使用tf.estimator.train_and_evaluate进行分布式训练 - tensorflow: how to make distributed training with tf.estimator.train_and_evaluate 如何在TensorFlow中使用分布式DNN培训? - How do I use distributed DNN training in TensorFlow?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM