在Tensorflow培训期间，GPU使用率非常低

Question

I am trying to train a simple multi-layer perceptron for a 10-class image classification task, which is a part of the assignment for the Udacity Deep-Learning course. 我正在尝试为10级图像分类任务训练一个简单的多层感知器，这是Udacity深度学习课程的一部分。 To be more precise, the task is to classify letters rendered from various fonts (the dataset is called notMNIST). 更确切地说，任务是对从各种字体呈现的字母进行分类（数据集称为notMNIST）。

The code I ended up with looks fairly simple, but no matter what I always get very low GPU usage during training. 我最终得到的代码看起来相当简单，但无论我在培训期间总是获得非常低的GPU使用率。 I measure load with GPU-Z and it shows just 25-30%. 我用GPU-Z测量负载，它只显示25-30％。

Here is my current code: 这是我目前的代码：

graph = tf.Graph()
with graph.as_default():
    tf.set_random_seed(52)

    # dataset definition
    dataset = Dataset.from_tensor_slices({'x': train_data, 'y': train_labels})
    dataset = dataset.shuffle(buffer_size=20000)
    dataset = dataset.batch(128)
    iterator = dataset.make_initializable_iterator()
    sample = iterator.get_next()
    x = sample['x']
    y = sample['y']

    # actual computation graph
    keep_prob = tf.placeholder(tf.float32)
    is_training = tf.placeholder(tf.bool, name='is_training')

    fc1 = dense_batch_relu_dropout(x, 1024, is_training, keep_prob, 'fc1')
    fc2 = dense_batch_relu_dropout(fc1, 300, is_training, keep_prob, 'fc2')
    fc3 = dense_batch_relu_dropout(fc2, 50, is_training, keep_prob, 'fc3')
    logits = dense(fc3, NUM_CLASSES, 'logits')

    with tf.name_scope('accuracy'):
        accuracy = tf.reduce_mean(
            tf.cast(tf.equal(tf.argmax(y, 1), tf.argmax(logits, 1)), tf.float32),
        )
        accuracy_percent = 100 * accuracy

    with tf.name_scope('loss'):
        loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))

    update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    with tf.control_dependencies(update_ops):
        # ensures that we execute the update_ops before performing the train_op
        # needed for batch normalization (apparently)
        train_op = tf.train.AdamOptimizer(learning_rate=1e-3, epsilon=1e-3).minimize(loss)

with tf.Session(graph=graph) as sess:
    tf.global_variables_initializer().run()
    step = 0
    epoch = 0
    while True:
        sess.run(iterator.initializer, feed_dict={})
        while True:
            step += 1
            try:
                sess.run(train_op, feed_dict={keep_prob: 0.5, is_training: True})
            except tf.errors.OutOfRangeError:
                logger.info('End of epoch #%d', epoch)
                break

        # end of epoch
        train_l, train_ac = sess.run(
            [loss, accuracy_percent],
            feed_dict={x: train_data, y: train_labels, keep_prob: 1, is_training: False},
        )
        test_l, test_ac = sess.run(
            [loss, accuracy_percent],
            feed_dict={x: test_data, y: test_labels, keep_prob: 1, is_training: False},
        )
        logger.info('Train loss: %f, train accuracy: %.2f%%', train_l, train_ac)
        logger.info('Test loss: %f, test accuracy: %.2f%%', test_l, test_ac)

        epoch += 1

Here's what I tried so far: 这是我到目前为止所尝试的：

I changed the input pipeline from simple feed_dict to tensorflow.contrib.data.Dataset . 我将输入管道从简单的feed_dict为tensorflow.contrib.data.Dataset 。 As far as I understood, it is supposed to take care of the efficiency of the input, eg load data in a separate thread. 据我所知，它应该考虑输入的效率，例如在单独的线程中加载数据。 So there should not be any bottleneck associated with the input. 所以不应该有任何与输入相关的瓶颈。
I collected traces as suggested here: https://github.com/tensorflow/tensorflow/issues/1824#issuecomment-225754659 However, these traces didn't really show anything interesting. 我收集了这里建议的痕迹： https ： //github.com/tensorflow/tensorflow/issues/1824#issuecomment-225754659但是，这些痕迹并没有真正显示出任何有趣的东西。 >90% of the train step is matmul operations. > 90％的列车步骤是matmul操作。
Changed batch size. 更改批量大小。 When I change it from 128 to 512 the load increases from ~30% to ~38%, when I increase it further to 2048, the load goes to ~45%. 当我将它从128改为512时，负载从~30％增加到~38％，当我进一步增加到2048时，负载变为~45％。 I have 6Gb GPU memory and dataset is single channel 28x28 images. 我有6Gb GPU内存，数据集是单通道28x28图像。 Am I really supposed to use such a big batch size? 我真的应该使用这么大的批量吗？ Should I increase it further? 我应该进一步增加吗？

Generally, should I worry about the low load, is it really a sign that I am training inefficiently? 一般来说，我是否应该担心低负荷，这是否真的表明我训练效率低下？

Here's the GPU-Z screenshots with 128 images in the batch. 这是批量处理128张图片的GPU-Z截图。 You can see low load with occasional spikes to 100% when I measure accuracy on the entire dataset after each epoch. 当我在每个纪元后测量整个数据集的精度时，您可以看到低负荷，偶尔出现峰值达到100％。

Answer 1

MNIST size networks are tiny and it's hard to achieve high GPU (or CPU) efficiency for them, I think 30% is not unusual for your application. MNIST规模的网络很小，很难为它们实现高GPU（或CPU）效率，我认为30％对你的应用来说并不罕见。 You will get higher computational efficiency with larger batch size, meaning you can process more examples per second, but you will also get lower statistical efficiency, meaning you need to process more examples total to get to target accuracy. 通过更大的批量大小，您将获得更高的计算效率，这意味着您可以每秒处理更多示例，但您也将获得更低的统计效率，这意味着您需要处理更多示例，以达到目标准确性。 So it's a trade-off. 所以这是一个权衡。 For tiny character models like yours, the statistical efficiency drops off very quickly after a 100, so it's probably not worth trying to grow the batch size for training. 对于像你这样的小型角色模型，统计效率在100之后会很快下降，因此可能不值得尝试增加训练的批量大小。 For inference, you should use the largest batch size you can. 对于推断，您应该使用最大的批量大小。

Answer 2

On my nVidia GTX 1080, if I use a convolutional neural network on the MNIST database, the GPU load is ~68%. 在我的nVidia GTX 1080上，如果我在MNIST数据库上使用卷积神经网络，GPU负载约为68％。

If I switch to a simple, non-convolutional network, then the GPU load is ~20%. 如果我切换到一个简单的非卷积网络，那么GPU负载约为20％。

You can replicate these results by building successively more advanced models in the tutorial Building Autoencoders in Keras by Francis Chollet . 您可以通过Francis Chollet在Keras中构建Autoencoders的教程中依次构建更高级的模型来复制这些结果。

在Tensorflow培训期间，GPU使用率非常低

问题描述

2 个解决方案

解决方案1
11 已采纳 2017-09-11 00:38:30

解决方案2
2 2018-01-20 15:42:43

在Tensorflow培训期间，GPU使用率非常低

问题描述

2 个解决方案

解决方案1 11 已采纳 2017-09-11 00:38:30

解决方案2 2 2018-01-20 15:42:43

解决方案1
11 已采纳 2017-09-11 00:38:30

解决方案2
2 2018-01-20 15:42:43