在Tensorflow培训期间，GPU使用率非常低

Question

我正在尝试为10级图像分类任务训练一个简单的多层感知器，这是Udacity深度学习课程的一部分。 更确切地说，任务是对从各种字体呈现的字母进行分类（数据集称为notMNIST）。

我最终得到的代码看起来相当简单，但无论我在培训期间总是获得非常低的GPU使用率。 我用GPU-Z测量负载，它只显示25-30％。

这是我目前的代码：

graph = tf.Graph()
with graph.as_default():
    tf.set_random_seed(52)

    # dataset definition
    dataset = Dataset.from_tensor_slices({'x': train_data, 'y': train_labels})
    dataset = dataset.shuffle(buffer_size=20000)
    dataset = dataset.batch(128)
    iterator = dataset.make_initializable_iterator()
    sample = iterator.get_next()
    x = sample['x']
    y = sample['y']

    # actual computation graph
    keep_prob = tf.placeholder(tf.float32)
    is_training = tf.placeholder(tf.bool, name='is_training')

    fc1 = dense_batch_relu_dropout(x, 1024, is_training, keep_prob, 'fc1')
    fc2 = dense_batch_relu_dropout(fc1, 300, is_training, keep_prob, 'fc2')
    fc3 = dense_batch_relu_dropout(fc2, 50, is_training, keep_prob, 'fc3')
    logits = dense(fc3, NUM_CLASSES, 'logits')

    with tf.name_scope('accuracy'):
        accuracy = tf.reduce_mean(
            tf.cast(tf.equal(tf.argmax(y, 1), tf.argmax(logits, 1)), tf.float32),
        )
        accuracy_percent = 100 * accuracy

    with tf.name_scope('loss'):
        loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y))

    update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    with tf.control_dependencies(update_ops):
        # ensures that we execute the update_ops before performing the train_op
        # needed for batch normalization (apparently)
        train_op = tf.train.AdamOptimizer(learning_rate=1e-3, epsilon=1e-3).minimize(loss)

with tf.Session(graph=graph) as sess:
    tf.global_variables_initializer().run()
    step = 0
    epoch = 0
    while True:
        sess.run(iterator.initializer, feed_dict={})
        while True:
            step += 1
            try:
                sess.run(train_op, feed_dict={keep_prob: 0.5, is_training: True})
            except tf.errors.OutOfRangeError:
                logger.info('End of epoch #%d', epoch)
                break

        # end of epoch
        train_l, train_ac = sess.run(
            [loss, accuracy_percent],
            feed_dict={x: train_data, y: train_labels, keep_prob: 1, is_training: False},
        )
        test_l, test_ac = sess.run(
            [loss, accuracy_percent],
            feed_dict={x: test_data, y: test_labels, keep_prob: 1, is_training: False},
        )
        logger.info('Train loss: %f, train accuracy: %.2f%%', train_l, train_ac)
        logger.info('Test loss: %f, test accuracy: %.2f%%', test_l, test_ac)

        epoch += 1

这是我到目前为止所尝试的：

我将输入管道从简单的feed_dict为tensorflow.contrib.data.Dataset 。 据我所知，它应该考虑输入的效率，例如在单独的线程中加载数据。 所以不应该有任何与输入相关的瓶颈。
我收集了这里建议的痕迹： https ： //github.com/tensorflow/tensorflow/issues/1824#issuecomment-225754659但是，这些痕迹并没有真正显示出任何有趣的东西。 > 90％的列车步骤是matmul操作。
更改批量大小。 当我将它从128改为512时，负载从~30％增加到~38％，当我进一步增加到2048时，负载变为~45％。 我有6Gb GPU内存，数据集是单通道28x28图像。 我真的应该使用这么大的批量吗？ 我应该进一步增加吗？

一般来说，我是否应该担心低负荷，这是否真的表明我训练效率低下？

这是批量处理128张图片的GPU-Z截图。 当我在每个纪元后测量整个数据集的精度时，您可以看到低负荷，偶尔出现峰值达到100％。

Answer 1

MNIST规模的网络很小，很难为它们实现高GPU（或CPU）效率，我认为30％对你的应用来说并不罕见。 通过更大的批量大小，您将获得更高的计算效率，这意味着您可以每秒处理更多示例，但您也将获得更低的统计效率，这意味着您需要处理更多示例，以达到目标准确性。 所以这是一个权衡。 对于像你这样的小型角色模型，统计效率在100之后会很快下降，因此可能不值得尝试增加训练的批量大小。 对于推断，您应该使用最大的批量大小。

Answer 2

在我的nVidia GTX 1080上，如果我在MNIST数据库上使用卷积神经网络，GPU负载约为68％。

如果我切换到一个简单的非卷积网络，那么GPU负载约为20％。

您可以通过Francis Chollet在Keras中构建Autoencoders的教程中依次构建更高级的模型来复制这些结果。

在Tensorflow培训期间，GPU使用率非常低

问题描述

2 个解决方案

解决方案1
11 已采纳 2017-09-11 00:38:30

解决方案2
2 2018-01-20 15:42:43

在Tensorflow培训期间，GPU使用率非常低

问题描述

2 个解决方案

解决方案1 11 已采纳 2017-09-11 00:38:30

解决方案2 2 2018-01-20 15:42:43

解决方案1
11 已采纳 2017-09-11 00:38:30

解决方案2
2 2018-01-20 15:42:43