[英]TensorFlow input pipeline performance
TL;DR: GPU underutilisation with tfrecords. TL; DR:带有tfrecords的GPU使用不足。 Questions in bold.
问题以粗体显示。
I'm seeing 100% CPU usage and 14% GPU usage. 我看到100%的CPU使用率和14%的GPU使用率。 I presume my input pipeline is the bottleneck.
我认为我的输入管道是瓶颈。 Hardware:
硬件:
I built a single 6GB tfrecord
-file using custom software. 我使用自定义软件构建了一个6GB的
tfrecord
-file。 But I am not using the default tf.train.Example
protocol buffers approach to form encode the records of the tfrecord
-file. 但是我没有使用默认的
tf.train.Example
协议缓冲区方法来对tfrecord
-file的记录进行形式编码。
Instead I do some bitcast magic myself, which looks like this: 相反,我自己做一些位魔术,看起来像这样:
def parse_fn(record):
record = tf.decode_raw(record, tf.uint8, little_endian=True)
record = tf.reshape(record, (1, 8 + 12 + 4 * num_features + 4 * num_labels))
time, pair, features, labels = tf.split(record, [8, 12, 4 * num_features, 4 * num_labels], axis=1)
time = tf.bitcast(time, tf.int64, name="time")
features = tf.bitcast(tf.reshape(features, (num_features, 4)), tf.float32, name="features")
labels = tf.bitcast(tf.reshape(labels, (num_labels, 4)), tf.float32, name="features")
time = tf.reshape(time, ())
pair = tf.reshape(pair, (-1, 12))
return time, pair, features, labels
Which is the mapper function for the TFRecordDataset
, which I create this way: 这是
TFRecordDataset
的映射器函数,我通过这种方式创建该函数:
def create_dataset(filename):
ds = tf.data.TFRecordDataset(filename)
ds = ds.map(map_func=parse_fn, num_parallel_calls=2)
ds = ds.prefetch(buffer_size=16 * 128)
ds = ds.shuffle(buffer_size=8 * 128)
ds = ds.batch(batch_size=128)
return ds
I have two questions on this: 我对此有两个问题:
decode_raw
/ bitcast
/ reshape
-based mapper function a problem in terms of speed? decode_raw
/ bitcast
/ reshape
/ reshape
的映射器功能是否存在问题? Would the example protocol buffer format be faster? map
, prefetch
, shuffle
, batch
) in create_dataset()
optimal? create_dataset()
中的调用顺序( map
, prefetch
, shuffle
, batch
create_dataset()
最优? And finally, I fear that, due to my mini-batch size of 128 and the fact that I run ±64000 minibatches per training epoch, Python takes much time in the training loop. 最后,我担心,由于我的小批量大小为128,并且每个训练时期运行±64000个小批量,因此Python在训练循环中花费了很多时间。 Are there better alternatives for this, where the TensorFlow C++ backend runs the train loop?
在TensorFlow C ++后端运行火车循环的情况下,还有其他更好的选择吗? My current Python training-loop, looks like this:
我当前的Python训练循环如下所示:
with sess.as_default():
for k in range(0, 400): #epoch loop
sess.run(iterator.initializer, feed_dict={filenames: ["train.tfrecord"]})
sum_tl = 0
sum_ll = 0
sum_tll = 0
count = 0
while True:
try:
lspeed = 5e-5
_, _, r_tl, r_ll, r_tll, r_summary = sess.run([dataset_next, optimizer, target_loss, label_loss, tweaked_label_loss, merged_summary_op],
feed_dict={is_training: True, dropout: 0.15, feature_noise_stddev: 0.07, learning_speed: lspeed, l2reg_strength: 2e-5})
sum_tl += r_tl
sum_ll += r_ll
sum_tll += r_tll
count += 1
if count % 100 == 0:
train_writer.add_summary(r_summary, super_k)
if count % 5000 == 1:
train_writer.flush()
print("Epoch " + str(k) + " / mini-batch " + str(count-1) + " : " + str(sum_tl/count) + " / " + str(np.sqrt(sum_ll/count)) + " / " + str(np.sqrt(sum_tll/count)))
except tf.errors.OutOfRangeError:
break
super_k += 1
batch_rmse = tf.Summary(value=[
tf.Summary.Value(tag="loss/target_batch", simple_value=sum_tl/count),
tf.Summary.Value(tag="rmse/batch", simple_value=np.mean(np.sqrt(sum_ll/count))),
tf.Summary.Value(tag="rmse/batch_0", simple_value=np.sqrt(sum_ll[0]/count)),
tf.Summary.Value(tag="rmse/batch_1", simple_value=np.sqrt(sum_ll[1]/count)),
tf.Summary.Value(tag="rmse/batch_2", simple_value=np.sqrt(sum_ll[2]/count)),
tf.Summary.Value(tag="rmse/batch_3", simple_value=np.sqrt(sum_ll[3]/count)),
tf.Summary.Value(tag="rmse/batch_4", simple_value=np.sqrt(sum_ll[4]/count)),
tf.Summary.Value(tag="tweaked_rmse/batch", simple_value=np.mean(np.sqrt(sum_tll/count))),
])
train_writer.add_summary(batch_rmse, super_k)
print("Epoch " + str(k) + " : " + str(sum_tl/count) + " / " + str(np.sqrt(sum_ll/count)) + " / " + str(np.sqrt(sum_tll/count)))
save()
predict_test(super_k)
Is this decode_raw/bitcast/reshape-based mapper function a problem in terms of speed?
就速度而言,这个基于解码/原始/位广播/重塑的映射器功能是否存在问题? Would the example protocol buffer format be faster?
示例协议缓冲区格式示例会更快吗?
Not from the looks of it. 不是从它的外观。 I don't quite know how intense casts are for Tensorflow internally;
我不太了解内部Tensorflow的投放情况如何; you could time the operations to find out for sure.
您可以安排时间确定操作。 But I would say you're fine on that front.
但是我会说你在那方面很好。
What I would look at is this repo that handles parsing a little smoother with less casting. 我要看的是这个回购 ,它可以用更少的转换来简化解析过程。 I don't know the make-up of your .tfrecords file, but maybe you can adapt it.
我不知道您的.tfrecords文件的组成,但也许您可以适应它。
Is the sequence of the calls (map, prefetch, shuffle, batch) in create_dataset() optimal?
create_dataset()中的调用顺序(映射,预取,随机播放,批处理)是否最佳?
For that one you should take a look at the Input Pipeline Performance Guide on the TF wiki. 为此,您应该阅读TF Wiki上的《 输入管道性能指南》 。 From what I read here on SO, calling
prefetch
last shows the best performance. 根据我在SO上阅读的内容,最后调用
prefetch
表现出最佳性能。 Using all (your 4) CPU cores for the map
method is recommended there as well. 在那里也建议使用所有(4个)CPU内核作为
map
方法。
Personally, I would also shard your .tfrecords file into smaller chunks so the reading is less costly. 就个人而言,我也将您的.tfrecords文件分片成较小的块,以使读取成本更低。 I assume your CPU has its hands full reading from a 6GB file, so much that it's slowed down when performing the operations you actually care about.
我认为您的CPU可以全力读取6GB的文件,以至于在执行您真正关心的操作时,它的速度变慢了。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.