TensorFlow输入管道性能

Question

TL;DR: GPU underutilisation with tfrecords. TL; DR：带有tfrecords的GPU使用不足。 Questions in bold. 问题以粗体显示。

I'm seeing 100% CPU usage and 14% GPU usage. 我看到100％的CPU使用率和14％的GPU使用率。 I presume my input pipeline is the bottleneck. 我认为我的输入管道是瓶颈。 Hardware: 硬件：

Intel i5-4460 @ 3.20GHz (4 cores) Intel i5-4460 @ 3.20GHz（4核）
NVIDIA GeForce GTX 1050 Ti NVIDIA GeForce GTX 1050 Ti

I built a single 6GB tfrecord -file using custom software. 我使用自定义软件构建了一个6GB的tfrecord -file。 But I am not using the default tf.train.Example protocol buffers approach to form encode the records of the tfrecord -file. 但是我没有使用默认的tf.train.Example协议缓冲区方法来对tfrecord -file的记录进行形式编码。

Instead I do some bitcast magic myself, which looks like this: 相反，我自己做一些位魔术，看起来像这样：

def parse_fn(record):
    record = tf.decode_raw(record, tf.uint8, little_endian=True)
    record = tf.reshape(record, (1, 8 + 12 + 4 * num_features + 4 * num_labels))
    time, pair, features, labels = tf.split(record, [8, 12, 4 * num_features, 4 * num_labels], axis=1)
    time = tf.bitcast(time, tf.int64, name="time")
    features = tf.bitcast(tf.reshape(features, (num_features, 4)), tf.float32, name="features")
    labels = tf.bitcast(tf.reshape(labels, (num_labels, 4)), tf.float32, name="features")

    time = tf.reshape(time, ())
    pair = tf.reshape(pair, (-1, 12))

    return time, pair, features, labels

Which is the mapper function for the TFRecordDataset , which I create this way: 这是TFRecordDataset的映射器函数，我通过这种方式创建该函数：

def create_dataset(filename):
    ds = tf.data.TFRecordDataset(filename)
    ds = ds.map(map_func=parse_fn, num_parallel_calls=2)
    ds = ds.prefetch(buffer_size=16 * 128)
    ds = ds.shuffle(buffer_size=8 * 128)
    ds = ds.batch(batch_size=128)
    return ds

I have two questions on this: 我对此有两个问题：

Is this decode_raw / bitcast / reshape -based mapper function a problem in terms of speed? 就速度而言，这个基于decode_raw / bitcast / reshape / reshape的映射器功能是否存在问题？ Would the example protocol buffer format be faster? 示例协议缓冲区格式示例会更快吗？
Is the sequence of the calls ( map , prefetch , shuffle , batch ) in create_dataset() optimal? create_dataset()中的调用顺序（ map ， prefetch ， shuffle ， batch create_dataset()最优？

And finally, I fear that, due to my mini-batch size of 128 and the fact that I run ±64000 minibatches per training epoch, Python takes much time in the training loop. 最后，我担心，由于我的小批量大小为128，并且每个训练时期运行±64000个小批量，因此Python在训练循环中花费了很多时间。 Are there better alternatives for this, where the TensorFlow C++ backend runs the train loop? 在TensorFlow C ++后端运行火车循环的情况下，还有其他更好的选择吗？ My current Python training-loop, looks like this: 我当前的Python训练循环如下所示：

with sess.as_default():
    for k in range(0, 400): #epoch loop
        sess.run(iterator.initializer, feed_dict={filenames: ["train.tfrecord"]})
        sum_tl = 0
        sum_ll = 0
        sum_tll = 0
        count = 0
        while True:
            try:
                lspeed = 5e-5
                _, _, r_tl, r_ll, r_tll, r_summary = sess.run([dataset_next, optimizer, target_loss, label_loss, tweaked_label_loss, merged_summary_op],
                                                        feed_dict={is_training: True, dropout: 0.15, feature_noise_stddev: 0.07, learning_speed: lspeed, l2reg_strength: 2e-5})
                sum_tl += r_tl
                sum_ll += r_ll
                sum_tll += r_tll
                count += 1
                if count % 100 == 0:
                    train_writer.add_summary(r_summary, super_k)
                if count % 5000 == 1:
                    train_writer.flush()
                    print("Epoch " + str(k) + " / mini-batch " + str(count-1) + " : " + str(sum_tl/count) + " / " + str(np.sqrt(sum_ll/count)) + " / " + str(np.sqrt(sum_tll/count)))
            except tf.errors.OutOfRangeError:
                  break
            super_k += 1
        batch_rmse = tf.Summary(value=[
            tf.Summary.Value(tag="loss/target_batch",  simple_value=sum_tl/count), 
            tf.Summary.Value(tag="rmse/batch",         simple_value=np.mean(np.sqrt(sum_ll/count))), 
            tf.Summary.Value(tag="rmse/batch_0",       simple_value=np.sqrt(sum_ll[0]/count)), 
            tf.Summary.Value(tag="rmse/batch_1",       simple_value=np.sqrt(sum_ll[1]/count)), 
            tf.Summary.Value(tag="rmse/batch_2",       simple_value=np.sqrt(sum_ll[2]/count)), 
            tf.Summary.Value(tag="rmse/batch_3",       simple_value=np.sqrt(sum_ll[3]/count)), 
            tf.Summary.Value(tag="rmse/batch_4",       simple_value=np.sqrt(sum_ll[4]/count)), 
            tf.Summary.Value(tag="tweaked_rmse/batch", simple_value=np.mean(np.sqrt(sum_tll/count))), 
        ])
        train_writer.add_summary(batch_rmse, super_k)
        print("Epoch " + str(k) + " : " + str(sum_tl/count) + " / " + str(np.sqrt(sum_ll/count)) + " / " + str(np.sqrt(sum_tll/count)))
        save()
        predict_test(super_k)

Answer 1

Is this decode_raw/bitcast/reshape-based mapper function a problem in terms of speed? 就速度而言，这个基于解码/原始/位广播/重塑的映射器功能是否存在问题？ Would the example protocol buffer format be faster? 示例协议缓冲区格式示例会更快吗？

Not from the looks of it. 不是从它的外观。 I don't quite know how intense casts are for Tensorflow internally; 我不太了解内部Tensorflow的投放情况如何； you could time the operations to find out for sure. 您可以安排时间确定操作。 But I would say you're fine on that front. 但是我会说你在那方面很好。

What I would look at is this repo that handles parsing a little smoother with less casting. 我要看的是这个回购，它可以用更少的转换来简化解析过程。 I don't know the make-up of your .tfrecords file, but maybe you can adapt it. 我不知道您的.tfrecords文件的组成，但也许您可以适应它。

Is the sequence of the calls (map, prefetch, shuffle, batch) in create_dataset() optimal? create_dataset（）中的调用顺序（映射，预取，随机播放，批处理）是否最佳？

For that one you should take a look at the Input Pipeline Performance Guide on the TF wiki. 为此，您应该阅读TF Wiki上的《输入管道性能指南》。 From what I read here on SO, calling prefetch last shows the best performance. 根据我在SO上阅读的内容，最后调用prefetch表现出最佳性能。 Using all (your 4) CPU cores for the map method is recommended there as well. 在那里也建议使用所有（4个）CPU内核作为map方法。

Personally, I would also shard your .tfrecords file into smaller chunks so the reading is less costly. 就个人而言，我也将您的.tfrecords文件分片成较小的块，以使读取成本更低。 I assume your CPU has its hands full reading from a 6GB file, so much that it's slowed down when performing the operations you actually care about. 我认为您的CPU可以全力读取6GB的文件，以至于在执行您真正关心的操作时，它的速度变慢了。

TensorFlow输入管道性能

问题描述

1 个解决方案

解决方案1
0 2018-09-08 17:52:56

TensorFlow输入管道性能

问题描述

1 个解决方案

解决方案1 0 2018-09-08 17:52:56

解决方案1
0 2018-09-08 17:52:56