使用Tensorflow Dataset API读取TFRecords文件时，预处理输入数据会减慢输入管道的速度

Question

I am using Tensorflow Dataset API to read TFRecords files, but the GPU usage is still low (10%). 我正在使用Tensorflow Dataset API读取TFRecords文件，但是GPU使用率仍然很低（10％）。 I reckon the cause is that I preprocess the data before they are fed into the sess.run() . 我认为原因是我在将数据输入sess.run()之前对其进行了sess.run() 。 Here is my code below. 这是下面的代码。
1. Create a dataset from 3 separate files. 1.从3个单独的文件创建一个数据集。

tf.reset_default_graph()

# The content of TFRecords files is that each row is an array. Calculate total rows.
n_total_row = sum(1 for _ in tf.python_io.tf_record_iterator(epd))

def get_epd_dataset(filename):
    dataset = tf.data.TFRecordDataset(filename)
    def _parse_function(example_proto):
        keys_to_features = {'data':tf.VarLenFeature(tf.int64)}
        parsed_features = tf.parse_single_example(example_proto, keys_to_features)
    return tf.sparse_tensor_to_dense(parsed_features['data'])
    # Parse the record into tensors.
    dataset = dataset.map(_parse_function)
    return dataset

# There are 3 essential files comprising input data. It reads 3 seperate
# files "epd", "y_id", "x_feat" into 3 separate dataset respectively, and 
# uses `Dataset.zip()` to combine these 3 separate files into 1 dataset.
epd_ds = get_epd_dataset(epd)
n_lexicon, id_ds = get_id_dataset(y_id)
feat_ds = get_feat_dataset(x_feat)
data_ds = tf.data.Dataset.zip((feat_ds, epd_ds, id_ds))

# Shuffle the dataset
data_ds = data_ds.shuffle(buffer_size=n_total_row, reshuffle_each_iteration=True)
# Repeat the input indefinitly
data_ds = data_ds.repeat(epoch)
# Generate batches
data_ds = data_ds.batch(1)
# Create a one-shot iterator
iterator = data_ds.make_one_shot_iterator()
data_iter = iterator.get_next()

2. Build a Tensorflow graph. 2.建立一个Tensorflow图。

n_input = DIM*(LEFT+1+RIGHT)
n_classes = n_lexicon

mlp = MultiLayerPerceptron.MultiLayerPerceptron(DIM*(LEFT+1+RIGHT), n_lexicon)
# tf Graph input
X = tf.placeholder("float", [None, n_input])
Y = tf.placeholder("float", [None, n_classes])
logits = mlp.multilayer_perceptron(X, dropout_mode)
loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y), name='loss_op')
optimizer = tf.train.AdamOptimizer(learning_rate=lr)
train_op = optimizer.minimize(loss_op, name='train_op')

3. Generate data from data_iter and run TF session. 3.从data_iter生成数据并运行TF会话。

sess = tf.Session()
# Initialization
sess.run(tf.global_variables_initializer())
for e in range(1, epoch+1):
    while True:
        try:
            # Get data from dataset iterator 
            tmp = sess.run([data_iter])[0]
            # a,b,c are a row from 3 serapate files.
            a = tmp[0].flatten()
            b = tmp[1].flatten()
            c = tmp[2].flatten()

            # I believe this step slows down my input pipeline.
            x_train, y_train = _data_generate(mlp, b, d, c)
            _, c = sess.run([train_op, loss_op], feed_dict={X: x_train,
                                                            Y: y_train})
        except tf.errors.OutOfRangeError:
            break
sess.close()

My code reaches about 10~15% of GPU usage. 我的代码达到了GPU使用率的10％到15％。 I think the cause is that _data_generate() consumes too much time on processing numpy array. 我认为原因是_data_generate()在处理numpy数组上花费了太多时间。 But I don't know how to improve my pipeline. 但我不知道如何改善管道。 Here are my questions. 这是我的问题。

According to Tensorflow performance guide and Importing Data , I think using Dataset API and TFRecords files is my best option to solve this low-GPU-usage problem. 根据Tensorflow性能指南和导入数据，我认为使用Dataset API和TFRecords文件是解决此GPU使用率低的问题的最佳选择。 Or should I use python multithread to feed data into a buffer first and then feed data to sess.run() . 还是应该先使用python多线程将数据馈入缓冲区，然后再将数据馈入sess.run() 。 I didn't choose the latter solution due to this website mention that 由于该网站提到，我没有选择后一种解决方案

We found that using tf.FIFOQueue and tf.train.queue_runner could not saturate multiple current generation GPUs when using large inputs and processing with higher samples per second, 我们发现使用tf.FIFOQueue和tf.train.queue_runner不能在使用大型输入和每秒处理更高采样的情况下使多个当前一代GPU饱和，

I think that putting _data_generate() in _parse_function() may solve this problem, bucause Tensorflow handles preprocessing data part but not python. 我认为将_data_generate()放在_parse_function()可能会解决此问题，bucause Tensorflow处理预处理数据部分，但不处理python。 But I don't know how to do this since _data_generate() needs 3 rows from 3 separate files. 但是我不知道该怎么做，因为_data_generate()需要3个独立文件中的3行。 Does anyone know how to do this? 有谁知道如何做到这一点？
Are there other methods could solve my low-GPU-usage problem? 还有其他方法可以解决GPU使用率低的问题吗？

Thank you. 谢谢。

Answer 1

Can you share the code of _data_generate function? 可以共享_data_generate函数的代码吗？ I can't see what it does. 我看不到它在做什么。

As you pointed out performance is likely lost because of RAM <-> GPU memory swap and mixing tensorflow ops with pythonic ones. 正如您所指出的，由于RAM <-> GPU内存交换以及将tensorflow操作与pythonic操作混合，性能可能会损失。

Instead of running iterator data_iter yourself by sess.run() , doing numpy operations and then training step, pass data_iter as input to your neural network graph - it should replace the placeholders. 不必自己通过sess.run()运行迭代器data_iter ， sess.run()执行numpy操作，然后进行训练，然后将data_iter作为输入传递给神经网络图-它应替换占位符。 (just make a function that constructs the graph using data_iter as parameter). （只需创建一个使用data_iter作为参数构造图的函数）。

I think that putting _data_generate() in _parse_function() may solve his problem, bucause Tensorflow handles preprocessing data part but not >python. 我认为将_data_generate（）放在_parse_function（）中可能会解决他的问题，bucause Tensorflow处理预处理数据部分，但不处理> python。 But I don't know how to do this since _data_generate() needs 3 >rows from 3 separate files. 但是我不知道该怎么做，因为_data_generate（）需要3个独立文件中的3行。 Does anyone know how to do this? 有谁知道如何做到这一点？

The proper way is to create 3 datasets from files, decode them, zip them, and then pass the iterator to zipped dataset as input to processing graph. 正确的方法是从文件中创建3个数据集，对其进行解码，压缩，然后将迭代器传递给压缩数据集，作为处理图的输入。 You're almost doing that. 您快要这样做了。

Also; 也; Try to enforce multithreading whenever it is possible/needed. 尝试在可能/需要时强制执行多线程。 Here: 这里：

...
return tf.sparse_tensor_to_dense(parsed_features['data'])
    # Parse the record into tensors.
    dataset = dataset.map(_parse_function)
    return dataset

You should use: 您应该使用：

dataset.map(_parse_function, num_threads=<MORE THAN ONE>)

Where <MORE THAN ONE> is an integer bigger than one. 其中<MORE THAN ONE>是大于1的整数。 In your case I would start with 8 threads (see if GPU will be 100%) 在您的情况下，我将从8个线程开始（看看GPU是否为100％）

Check dis out and tell me if its ok 检查一下dis，然后告诉我是否还可以

Answer 2

I'm assuming your example uses a simplified version of your model, otherwise the GPU will almost always terminate its work before the next batch is ready. 我假设您的示例使用模型的简化版本，否则GPU几乎总是在下一批准备就绪之前终止其工作。

Each dataset and transofrmation pipeline has its own specificities and it's difficult to provide a definite answer, but here might be some points worth investigating: 每个数据集和跨行业管道都有其自身的特殊性，很难提供确切的答案，但是这里可能有一些值得研究的地方：

You are not supposed to compute the values of data_iter explicitely, and you should not use placeholders and feed_dict anymore. 您不应该显式计算data_iter的值，也不应再使用占位符和feed_dict。 The values returned by .make_one_shot_iterator().get_next() are already the input nodes of your model, not the placeholder variables. .make_one_shot_iterator（）。get_next（）返回的值已经是模型的输入节点，而不是占位符变量。 This will glue together the input pipeline with your model and avoid sending data back and forth between tensorflow CPU, python and tensorflow GPU memory regions. 这会将输入管道与您的模型粘合在一起，避免在tensorflow CPU，python和tensorflow GPU内存区域之间来回发送数据。 In your session.run() call, you need not specify any input value, the model will automatically feed itself on the dataset as needed. 在session.run（）调用中，您无需指定任何输入值，模型将根据需要自动将其自身输入数据集。
I don't think TFRecord files are supposed to be fast on random access due to the unknown element size. 由于元素大小未知，我不认为TFRecord文件在随机访问中应该很快。 In your case, you also seem to use random access three times: for epd, feats and the metadata. 就您而言，您似乎还使用了三次随机访问：用于epd，功能和元数据。
Did you instruct tensorflow to use prefetching somewhere? 您是否指示过tensorflow在某处使用预取？ Otherwise tensorflow may wait for the current batch to be processed before loading and processing the next one. 否则，tensorflow可能会等待当前批次被处理，然后再加载和处理下一个批次。
You might want to check if the multiple workers of the map function are not waiting for some resource (disk IO, CPU time, locked object?) 您可能想要检查map函数的多个工作线程是否没有在等待某些资源（磁盘IO，CPU时间，锁定的对象？）。
tf.data API is tricky to debug and not very practical (seriously, how could it not know about anything about tensors?), you may want to have a look at the tensorpack library or the old tensorflow Queue data loading API. tf.data API调试起来很棘手，并且不太实用（严重的是，它怎么可能不了解张量？），您可能想看看tensorpack库或旧的tensorflow Queue数据加载API。

使用Tensorflow Dataset API读取TFRecords文件时，预处理输入数据会减慢输入管道的速度

问题描述

2 个解决方案

解决方案1
0 2017-12-27 16:52:08

解决方案2
0 2018-03-28 09:20:30

使用Tensorflow Dataset API读取TFRecords文件时，预处理输入数据会减慢输入管道的速度

问题描述

2 个解决方案

解决方案1 0 2017-12-27 16:52:08

解决方案2 0 2018-03-28 09:20:30

解决方案1
0 2017-12-27 16:52:08

解决方案2
0 2018-03-28 09:20:30