TL;DR: GPU underutilisation with tfrecords. Questions in bold.
I'm seeing 100% CPU usage and 14% GPU usage. I presume my input pipeline is the bottleneck. Hardware:
I built a single 6GB tfrecord
-file using custom software. But I am not using the default tf.train.Example
protocol buffers approach to form encode the records of the tfrecord
-file.
Instead I do some bitcast magic myself, which looks like this:
def parse_fn(record):
record = tf.decode_raw(record, tf.uint8, little_endian=True)
record = tf.reshape(record, (1, 8 + 12 + 4 * num_features + 4 * num_labels))
time, pair, features, labels = tf.split(record, [8, 12, 4 * num_features, 4 * num_labels], axis=1)
time = tf.bitcast(time, tf.int64, name="time")
features = tf.bitcast(tf.reshape(features, (num_features, 4)), tf.float32, name="features")
labels = tf.bitcast(tf.reshape(labels, (num_labels, 4)), tf.float32, name="features")
time = tf.reshape(time, ())
pair = tf.reshape(pair, (-1, 12))
return time, pair, features, labels
Which is the mapper function for the TFRecordDataset
, which I create this way:
def create_dataset(filename):
ds = tf.data.TFRecordDataset(filename)
ds = ds.map(map_func=parse_fn, num_parallel_calls=2)
ds = ds.prefetch(buffer_size=16 * 128)
ds = ds.shuffle(buffer_size=8 * 128)
ds = ds.batch(batch_size=128)
return ds
I have two questions on this:
decode_raw
/ bitcast
/ reshape
-based mapper function a problem in terms of speed? Would the example protocol buffer format be faster? map
, prefetch
, shuffle
, batch
) in create_dataset()
optimal? And finally, I fear that, due to my mini-batch size of 128 and the fact that I run ±64000 minibatches per training epoch, Python takes much time in the training loop. Are there better alternatives for this, where the TensorFlow C++ backend runs the train loop? My current Python training-loop, looks like this:
with sess.as_default():
for k in range(0, 400): #epoch loop
sess.run(iterator.initializer, feed_dict={filenames: ["train.tfrecord"]})
sum_tl = 0
sum_ll = 0
sum_tll = 0
count = 0
while True:
try:
lspeed = 5e-5
_, _, r_tl, r_ll, r_tll, r_summary = sess.run([dataset_next, optimizer, target_loss, label_loss, tweaked_label_loss, merged_summary_op],
feed_dict={is_training: True, dropout: 0.15, feature_noise_stddev: 0.07, learning_speed: lspeed, l2reg_strength: 2e-5})
sum_tl += r_tl
sum_ll += r_ll
sum_tll += r_tll
count += 1
if count % 100 == 0:
train_writer.add_summary(r_summary, super_k)
if count % 5000 == 1:
train_writer.flush()
print("Epoch " + str(k) + " / mini-batch " + str(count-1) + " : " + str(sum_tl/count) + " / " + str(np.sqrt(sum_ll/count)) + " / " + str(np.sqrt(sum_tll/count)))
except tf.errors.OutOfRangeError:
break
super_k += 1
batch_rmse = tf.Summary(value=[
tf.Summary.Value(tag="loss/target_batch", simple_value=sum_tl/count),
tf.Summary.Value(tag="rmse/batch", simple_value=np.mean(np.sqrt(sum_ll/count))),
tf.Summary.Value(tag="rmse/batch_0", simple_value=np.sqrt(sum_ll[0]/count)),
tf.Summary.Value(tag="rmse/batch_1", simple_value=np.sqrt(sum_ll[1]/count)),
tf.Summary.Value(tag="rmse/batch_2", simple_value=np.sqrt(sum_ll[2]/count)),
tf.Summary.Value(tag="rmse/batch_3", simple_value=np.sqrt(sum_ll[3]/count)),
tf.Summary.Value(tag="rmse/batch_4", simple_value=np.sqrt(sum_ll[4]/count)),
tf.Summary.Value(tag="tweaked_rmse/batch", simple_value=np.mean(np.sqrt(sum_tll/count))),
])
train_writer.add_summary(batch_rmse, super_k)
print("Epoch " + str(k) + " : " + str(sum_tl/count) + " / " + str(np.sqrt(sum_ll/count)) + " / " + str(np.sqrt(sum_tll/count)))
save()
predict_test(super_k)
Is this decode_raw/bitcast/reshape-based mapper function a problem in terms of speed? Would the example protocol buffer format be faster?
Not from the looks of it. I don't quite know how intense casts are for Tensorflow internally; you could time the operations to find out for sure. But I would say you're fine on that front.
What I would look at is this repo that handles parsing a little smoother with less casting. I don't know the make-up of your .tfrecords file, but maybe you can adapt it.
Is the sequence of the calls (map, prefetch, shuffle, batch) in create_dataset() optimal?
For that one you should take a look at the Input Pipeline Performance Guide on the TF wiki. From what I read here on SO, calling prefetch
last shows the best performance. Using all (your 4) CPU cores for the map
method is recommended there as well.
Personally, I would also shard your .tfrecords file into smaller chunks so the reading is less costly. I assume your CPU has its hands full reading from a 6GB file, so much that it's slowed down when performing the operations you actually care about.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.