简体   繁体   English

张量流分布式培训中的FLAGS和解析器

[英]FLAGS and parsers in tensorflow distributed training

So I was trying to learn about distributed training in tensorflow. 所以我试图学习张量流的分布式训练。 To practice myself, I was trying the code from https://github.com/hn826/distributed-tensorflow/blob/master/distributed-deep-mnist.py 为了练习自己,我正在尝试https://github.com/hn826/distributed-tensorflow/blob/master/distributed-deep-mnist.py中的代码

import argparse
import sys

from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf

FLAGS = None

def deepnn(x):
  x_image = tf.reshape(x, [-1, 28, 28, 1])

  W_conv1 = weight_variable([5, 5, 1, 32])
  b_conv1 = bias_variable([32])
  h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)

  h_pool1 = max_pool_2x2(h_conv1)

  W_conv2 = weight_variable([5, 5, 32, 64])
  b_conv2 = bias_variable([64])
  h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)

  h_pool2 = max_pool_2x2(h_conv2)

  W_fc1 = weight_variable([7 * 7 * 64, 1024])
  b_fc1 = bias_variable([1024])

  h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
  h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

  keep_prob = tf.placeholder(tf.float32)
  h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

  W_fc2 = weight_variable([1024, 10])
  b_fc2 = bias_variable([10])

  y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2
  return y_conv, keep_prob

def conv2d(x, W):
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
  return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1], padding='SAME')

def weight_variable(shape):
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial)

def bias_variable(shape):
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)

def main(_):
  ps_hosts = FLAGS.ps_hosts.split(",")
  worker_hosts = FLAGS.worker_hosts.split(",")

  # Create a cluster from the parameter server and worker hosts.
  cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})

  # Create and start a server for the local task.
  server = tf.train.Server(cluster,
                           job_name=FLAGS.job_name,
                           task_index=FLAGS.task_index)

  if FLAGS.job_name == "ps":
    server.join()
  elif FLAGS.job_name == "worker":

    # Assigns ops to the local worker by default.
    with tf.device(tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d" % FLAGS.task_index,
        cluster=cluster)):

      # Import data
      mnist = input_data.read_data_sets(FLAGS.data_dir, one_hot=True)

      # Build Deep MNIST model...
      x = tf.placeholder(tf.float32, [None, 784])
      y_ = tf.placeholder(tf.float32, [None, 10])
      y_conv, keep_prob = deepnn(x)

      cross_entropy = tf.reduce_mean(
          tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv))

      global_step = tf.contrib.framework.get_or_create_global_step()

      train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy, global_step=global_step)
      correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
      accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

    # The StopAtStepHook handles stopping after running given steps.
    hooks=[tf.train.StopAtStepHook(last_step=1000)]

    # The MonitoredTrainingSession takes care of session initialization,
    # restoring from a checkpoint, saving to a checkpoint, and closing when done
    # or an error occurs.
    with tf.train.MonitoredTrainingSession(master=server.target,
                                           is_chief=(FLAGS.task_index == 0),
                                           checkpoint_dir=FLAGS.log_dir,
                                           hooks=hooks) as mon_sess:
      i = 0
      while not mon_sess.should_stop():
        # Run a training step asynchronously.
        batch = mnist.train.next_batch(50)
        if i % 100 == 0:
          train_accuracy = mon_sess.run(accuracy, feed_dict={
              x: batch[0], y_: batch[1], keep_prob: 1.0})
          print('global_step %s, task:%d_step %d, training accuracy %g'
                % (tf.train.global_step(mon_sess, global_step), FLAGS.task_index, i, train_accuracy))
        mon_sess.run(train_step, feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})
        i = i + 1

if __name__ == "__main__":
  parser = argparse.ArgumentParser()
  parser.register("type", "bool", lambda v: v.lower() == "true")
  # Flags for defining the tf.train.ClusterSpec
  parser.add_argument(
      "--ps_hosts",
      type=str,
      default="",
      help="Comma-separated list of hostname:port pairs"
  )
  parser.add_argument(
      "--worker_hosts",
      type=str,
      default="",
      help="Comma-separated list of hostname:port pairs"
  )
  parser.add_argument(
      "--job_name",
      type=str,
      default="",
      help="One of 'ps', 'worker'"
  )
  # Flags for defining the tf.train.Server
  parser.add_argument(
      "--task_index",
      type=int,
      default=0,
      help="Index of task within the job"
  )
  # Flags for specifying input/output directories
  parser.add_argument(
      "--data_dir",
      type=str,
      default="/tmp/mnist_data",
      help="Directory for storing input data")
  parser.add_argument(
      "--log_dir",
      type=str,
      default="/tmp/train_logs",
      help="Directory for train logs")
  FLAGS, unparsed = parser.parse_known_args()
  tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)

I have understood most of the things except some concepts. 除了一些概念,我已经了解了大多数事情。
Firstly, about FLAGS . 首先,关于FLAGS As far as I have understood, the tasks and the workers are all defined in it. 据我了解,任务和工人都在其中定义。 But I am confused how. 但是我很困惑。

Secondly, about the parsers. 其次,关于解析器。 What are they and why do we use them here for? 它们是什么?为什么在这里使用它们? I have realized that doing parser.add_argument() gives you options when running the code in the terminal. 我已经意识到,在终端中运行代码时,执行parser.add_argument()会为您提供选项。

I guess parser and FLAGS is somehow connected. 我想parserFLAGS某种程度上是相连的。 So knowing what they do, would probably shoo away all the question marks in my head. 因此,知道他们的所作所为,可能会消除我脑海中所有的问号。

Firstly, about FLAGS . 首先,关于FLAGS As far as I have understood, the tasks and the workers are all defined in it. 据我了解,任务和工人都在其中定义。 But I am confused how. 但是我很困惑。

Yes, this is the standard way to run tensorflow in distributed setting (your particular case is Between-Graph Replication strategy). 是的,这是在分布式设置中运行张量流的标准方法(您的特殊情况是“图间复制”策略)。 Basically, the same script starts different nodes (workers, parameter server, etc), which perform the training together. 基本上,同一脚本启动不同的节点(工作人员,参数服务器等),这些节点一起执行培训。 This tutorial discusses various strategies in tensorflow and explains well how it translates to code. 本教程讨论了Tensorflow中的各种策略,并很好地解释了它如何转换为代码。

Here's an example how you can work with this script. 这是一个如何使用此脚本的示例。 Start 4 processes (2 ps server and 2 workers): 启动4个进程(2 ps服务器和2个worker):

# On ps0.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=ps --task_index=0
# On ps1.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=ps --task_index=1
# On worker0.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=worker --task_index=0
# On worker1.example.com:
$ python trainer.py \
     --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
     --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
     --job_name=worker --task_index=1

Secondly, about the parsers. 其次,关于解析器。 What are they and why do we use them here for? 它们是什么?为什么在这里使用它们?

It's python way to deal with command line arguments: argparse . 这是处理命令行参数的python方法: argparse Different options allow to specify the type and bounds for each argument (thus define the validator), assign actions and much more (check out the documentation for available features). 不同的选项允许为每个参数指定类型和范围(从而定义验证器),分配操作等(请查看文档以了解可用功能)。 The parser then takes the command line string and magically sets the variables with just one call: 然后,解析器获取命令行字符串,并只需一次调用即可神奇地设置变量:

FLAGS, unparsed = parser.parse_known_args()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Tensorflow输入管道用于分布式培训 - Tensorflow input pipeline for distributed training Tensorflow:在分布式培训中使用参数服务器 - Tensorflow: Using Parameter Servers in Distributed Training Tensorflow 分布式训练挂起并出现 RecvTensor 取消警告 - Tensorflow distributed training hangs with RecvTensor cancelled warning 异步培训如何在分布式Tensorflow中工作? - How does asynchronous training work in distributed Tensorflow? Tensorflow 每个 epoch 后的分布式训练暂停 - Tensorflow distributed training pause after each epoch Tensorflow Cloud ML对象检测-分布式训练中的错误 - Tensorflow Cloud ML Object Detection - Errors with Distributed Training 如何在TensorFlow中使用分布式DNN培训? - How do I use distributed DNN training in TensorFlow? 如何使用 Tensorflow 进行自定义损失的分布式训练? - How to use distributed training with a custom loss using Tensorflow? tensorflow:如何使用tf.estimator.train_and_evaluate进行分布式训练 - tensorflow: how to make distributed training with tf.estimator.train_and_evaluate 分布式Tensorflow,Master在培训期间卡住了,工人在使用SyncReplicasOptimizer和MonitoredTrainingSession时没有开始培训? - Distributed Tensorflow, Master stuck while training, workers do not start training, while using SyncReplicasOptimizer and MonitoredTrainingSession?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM