Tensorflow輸入管道用於分布式培訓

Question

我試圖弄清楚如何在分布式訓練中為張量流設置輸入管道。 尚不清楚讀取器是否將從單個進程中讀取並將數據發送給所有工作進程，還是每個服務器將啟動其自己的輸入管道？ 我們如何確保每個工人都得到不同的投入？

Answer 1

我將舉例說明如何做到這一點：

import tensorflow as tf
batch_size = 50
task_index = 2
num_workers = 10
input_pattern = "gs://backet/dir/part-00*"

獲取存儲桶中與input_pattern對應的文件的所有名稱

files_names = tf.train.match_filenames_once(
                input_pattern, name = "myFiles")

選擇工作程序task_index名稱。 tf.strided_slice類似於列表的切片：a [::，task_index]（為工作器task_index選擇每個task_index文件）

to_process = tf.strided_slice(files_names, [task_index],
                 [999999999], strides=[num_workers])
filename_queue = tf.train.string_input_producer(to_process,
                     shuffle=True, #shufle files
                     num_epochs=num_epochs)

reader = tf.TextLineReader()
_ , value = reader.read(filename_queue)
col1,col2 = tf.decode_csv(value,
        record_defaults=[[1],[1]], field_delim="\t")

train_inputs, train_labels = tf.train.shuffle_batch([col1,[col2]],
        batch_size=batch_size,
        capacity=50*batch_size,
        num_threads=10,
        min_after_dequeue = 10*batch_size,
        allow_smaller_final_batch = True)

loss = f(...,train_inputs, train_labels)
optimizer = ...

with tf.train.MonitoredTrainingSession(...) as mon_sess:
    coord = tf.train.Coordinator()
    with coord.stop_on_exception():
        _ = tf.train.start_queue_runners(sess = mon_sess, coord=coord)
        while not coord.should_stop() and not mon_sess.should_stop():
            optimizer.run()

我不確定在分布式TensorFlow實現的情況下我的方法是實現輸入管道的最佳方法，因為每個工作程序都讀取存儲桶中所有文件的名稱

關於TensorFlow中輸入管道的好講座： http ://web.stanford.edu/class/cs20si/lectures/notes_09.pdf

Tensorflow輸入管道用於分布式培訓

問題描述

1 個解決方案

解決方案1
1 2017-08-05 09:04:39

Tensorflow輸入管道用於分布式培訓

問題描述

1 個解決方案

解決方案1 1 2017-08-05 09:04:39

解決方案1
1 2017-08-05 09:04:39