运行Tensorflow时GPU利用率低

Question

I've been doing Deep Reinforcement Learning using Tensorflow and OpenAI gym. 我一直在使用Tensorflow和OpenAI健身房进行深层强化学习。 My problem is low GPU utilisation. 我的问题是GPU利用率低。 Googling this issue, I understood that it's wrong to expect much GPU utilisation when training small networks ( eg. for training mnist). 谷歌搜索这个问题，我明白在培训小型网络（例如培训mnist）时期望获得大量GPU利用是错误的。 But my Neural Network is not so small, I think. 但我认为，我的神经网络并不是那么小。 The architecture is similar to the given in the original deepmind paper (more or less). 该架构类似于原始的深度论文中给出的（或多或少）。 The architecture of my network is summarized below 我的网络架构总结如下

Convolution layer 1 (filters=32, kernel_size=8x8, strides=4) 卷积层1（filters = 32，kernel_size = 8x8，strides = 4）
Convolution layer 2 (filters=64, kernel_size=8x8, strides=2) 卷积层2（过滤器= 64，kernel_size = 8x8，步幅= 2）
Convolution layer 3 (filters=64, kernel_size=8x8, strides=1) 卷积层3（过滤器= 64，kernel_size = 8x8，步幅= 1）
Dense layer (units=512) 密集层（单位= 512）
Output layer (units=9) 输出层（单位= 9）

I am training on a Tesla P100 16GB gpu. 我正在训练特斯拉P100 16GB gpu。 My learning algorithm is Simple DQN. 我的学习算法是Simple DQN。 (Again, from the Deepmind paper). （再次，来自Deepmind论文）。 Hyper-parameters are all as given in the paper. 超参数都在论文中给出。 Still GPU utilisation is well below 10% (as shown by nvidia-smi). GPU利用率仍远低于10％（如nvidia-smi所示）。 What could be the possible issue? 可能的问题是什么？

import tensorflow as tf
import numpy as np
import os, sys
import gym
from collections import deque
from time import sleep
import os


os.environ['CUDA_VISIBLE_DEVICES'] = '1'


def reset_graph(seed=142):
    tf.reset_default_graph()


def preprocess_observation(obs):
    img = obs[34:210:2, ::2] # crop and downsize
    return np.mean(img, axis=2).reshape(88, 80) / 255.0


def combine_observations_multichannel(preprocessed_observations):
    return np.array(preprocessed_observations).transpose([1, 2, 0])


n_observations_per_state = 3
preprocessed_observations = deque([], maxlen=n_observations_per_state)
env = gym.make("Breakout-v0")
obs = env.reset()


input_height = 88
input_width = 80
input_channels = 3
conv_n_maps = [32, 64, 64]
conv_kernel_sizes = [(8,8), (4,4), (3,3)]
conv_strides = [4, 2, 1]
conv_paddings = ["SAME"] * 3 
conv_activation = [tf.nn.relu] * 3
n_hidden_in = 64 * 11 * 10  # conv3 has 64 maps of 10x10 each
n_hidden = 512
hidden_activation = tf.nn.relu
n_outputs = env.action_space.n  # Number of discrete actions are available
initializer = tf.variance_scaling_initializer()


def q_network(X_state, name):
    prev_layer = X_state
    with tf.variable_scope(name) as scope:
        for n_maps, kernel_size, strides, padding, activation in zip(
                conv_n_maps, conv_kernel_sizes, conv_strides,
                conv_paddings, conv_activation):
            prev_layer = tf.layers.conv2d(
                prev_layer, filters=n_maps, kernel_size=kernel_size,
                strides=strides, padding=padding, activation=activation,
                kernel_initializer=initializer)
        last_conv_layer_flat = tf.reshape(prev_layer, shape=[-1, n_hidden_in])
        hidden = tf.layers.dense(last_conv_layer_flat, n_hidden,
                                 activation=hidden_activation,
                                 kernel_initializer=initializer)
        outputs = tf.layers.dense(hidden, n_outputs,
                                  kernel_initializer=initializer)
    trainable_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
                                       scope=scope.name)
    trainable_vars_by_name = {var.name[len(scope.name):]: var
                              for var in trainable_vars}
    return outputs, trainable_vars_by_name


X_state = tf.placeholder(tf.float32, shape=[None, input_height, input_width,
                                            input_channels])
online_q_values, online_vars = q_network(X_state, name="q_networks/online")
target_q_values, target_vars = q_network(X_state, name="q_networks/target")

copy_ops = [target_var.assign(online_vars[var_name])
            for var_name, target_var in target_vars.items()]
copy_online_to_target = tf.group(*copy_ops)


learning_rate = 0.001
momentum = 0.95

with tf.variable_scope("train"):
    X_action = tf.placeholder(tf.int32, shape=[None])
    y = tf.placeholder(tf.float32, shape=[None, 1])
    q_value = tf.reduce_sum(online_q_values * tf.one_hot(X_action, n_outputs),
                            axis=1, keep_dims=True)
    loss = tf.reduce_mean((y - q_value) ** 2) 

    global_step = tf.Variable(0, trainable=False, name='global_step')
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum, use_nesterov=True)
    training_op = optimizer.minimize(loss, global_step=global_step)


replay_memory_size = 500000
replay_memory = deque([], maxlen=replay_memory_size)

def sample_memories(batch_size):
    indices = np.random.permutation(len(replay_memory))[:batch_size]
    cols = [[], [], [], [], []] # state, action, reward, next_state, continue
    for idx in indices:
        memory = replay_memory[idx]
        for col, value in zip(cols, memory):
            col.append(value)
    cols = [np.array(col) for col in cols]
    return cols[0], cols[1], cols[2].reshape(-1, 1), cols[3], cols[4].reshape(-1, 1)


eps_min = 0.1
eps_max = 1.0
eps_decay_steps = 2000000


def epsilon_greedy(q_values, step):
    epsilon = max(eps_min, eps_max - (eps_max-eps_min) * step/eps_decay_steps)
    if np.random.rand() < epsilon:
        return np.random.randint(n_outputs) # random action
    else:
        return np.argmax(q_values) # optimal action

n_steps = 4000000  # total number of training steps
training_start = 10000  # start training after 10,000 game iterations
training_interval = 4  # run a training step every 4 game iterations
save_steps = 1000  # save the model every 1,000 training steps
copy_steps = 10000  # copy online DQN to target DQN every 10,000 training steps
discount_rate = 0.99
skip_start = 5  # Skip the start of every game (it's just waiting time).
batch_size = 64
iteration = 0  # game iterations
checkpoint_dir = './saved_networks'
checkpoint_path = "./saved_networks/dqn_breakout.cpkt"
summary_path = "./summary/"
done = True # env needs to be reset

# Summary variables
svar_reward = tf.Variable(tf.zeros([1], dtype=tf.int32)) # Episode reward
svar_mmq = tf.Variable(tf.zeros([1]), dtype=tf.float32) # Episode Mean-Max-Q
svar_loss = tf.Variable(tf.zeros([1], dtype=tf.float64))
all_svars = [svar_reward, svar_mmq, svar_loss]
tf.summary.scalar("Episode Reward", tf.squeeze(svar_reward))
tf.summary.scalar("Episode Mean-Max-Q", tf.squeeze(svar_mmq))
tf.summary.scalar("Episode MSE", tf.squeeze(svar_loss))
# Placeholders
svar_reward_p, svar_mmq_p =  tf.placeholder(tf.int32, [1]), tf.placeholder(tf.float32, [1])
svar_loss_p = tf.placeholder(tf.float64, [1])
svars_placeholders = [svar_reward_p,  svar_mmq_p, svar_loss_p]

# Assign operation
summary_assign_op = [all_svars[i].assign(svars_placeholders[i]) for i in range(len(svars_placeholders))]
writer = tf.summary.FileWriter(summary_path)
summary_op = tf.summary.merge_all()
# For keeping track of no. of episodes played.
episode_step = tf.Variable(tf.zeros([1], dtype=tf.int64), trainable=False)
inc_episode_count = episode_step.assign_add([1])


init = tf.global_variables_initializer()
saver = tf.train.Saver()


loss_val = np.infty
game_length = 0
total_max_q = 0
mean_max_q = 0.0
ep_reward = 0
ep_loss = 0.

with tf.Session() as sess:
    if os.path.isfile(checkpoint_path + ".index"):
        saver.restore(sess, checkpoint_path)
        print("<--------------------- Graph restored! -------------------------->")
    else:
        print("<--------- No checkpoints found! Starting over.. ---------------->")
        init.run()
        copy_online_to_target.run()
    while True:
        step = global_step.eval()
        if step >= n_steps:
            break
        iteration += 1
        print("\rIteration {}\tTraining step {}/{} ({:.1f})%\tLoss {:5f}\tMean Max-Q {:5f}   ".format(
            iteration, step, n_steps, step * 100 / n_steps, loss_val, mean_max_q), end="")
        if done: # game over, start again
            obs = env.reset()
            # Clear observations from the past episode
            preprocessed_observations.clear()
            for skip in range(skip_start): # skip the start of each game
                obs, reward, done, info = env.step(0) # Do nothing
                preprocessed_observations.append(preprocess_observation(obs))
            state = combine_observations_multichannel(preprocessed_observations)
        # Online DQN evaluates what to do
        q_values = online_q_values.eval(feed_dict={X_state: [state]})
        action = epsilon_greedy(q_values, step)

        # Online DQN plays
        obs, reward, done, info = env.step(action)
        ep_reward += reward
        preprocessed_observations.append(preprocess_observation(obs))
        next_state = combine_observations_multichannel(preprocessed_observations)

        # Let's memorize what happened
        replay_memory.append((state, action, reward, next_state, 1.0 - done))
        state = next_state

        # Compute statistics for tracking progress
        total_max_q += q_values.max()
        game_length += 1
        if done:
            mean_max_q = total_max_q / game_length
            # Write summary -- start
            if iteration >= training_start:
                sess.run(summary_assign_op, feed_dict={
                    svar_reward_p: [ep_reward],
                    svar_mmq_p: [mean_max_q],
                    svar_loss_p: [ep_loss],
                })
                summaries_str = sess.run(summary_op)
                writer.add_summary(summaries_str, sess.run(episode_step))
                sess.run(inc_episode_count)
            # Write summary -- end
            total_max_q = 0.0
            game_length = ep_reward = ep_loss = 0

        if iteration < training_start or iteration % training_interval != 0:
            continue # only train after warmup period and at regular intervals

        # Sample memories and use the target DQN to produce the target Q-Value
        X_state_val, X_action_val, rewards, X_next_state_val, continues = (
            sample_memories(batch_size))
        next_q_values = target_q_values.eval(
            feed_dict={X_state: X_next_state_val})
        max_next_q_values = np.max(next_q_values, axis=1, keepdims=True)
        y_val = rewards + continues * discount_rate * max_next_q_values

        # Train the online DQN
        _, loss_val = sess.run([training_op, loss], feed_dict={
            X_state: X_state_val, X_action: X_action_val, y: y_val})
        ep_loss += loss_val
        # Regularly copy the online DQN to the target DQN
        if step % copy_steps == 0:
            copy_online_to_target.run()

        # And save regularly
        if step % save_steps == 0:
            saver.save(sess, checkpoint_path)

Answer 1

It's difficult to tell for sure without more details (such as seeing your code, knowing which gym environments you're training on, CPU utilization, hyperparam values, ...). 如果没有更多细节（例如查看代码，了解您正在训练的健身房环境，CPU利用率，超级参数值......），很难确定。 Some possible reasons: 一些可能的原因：

Low batch size 批量小
The step() function of the environment will still be running on your CPU, if that part takes a lot of time your GPU will be sitting idle for a while every iteration 环境的step()函数仍将在您的CPU上运行，如果该部分需要花费大量时间，您的GPU将在每次迭代中闲置一段时间
The same as above counts for all kinds of other code in every iteration of your training loop (such as tracking results, storing stuff in replay buffer, fetching stuff from replay buffer) 与上述相同，在训练循环的每次迭代中计算所有其他代码（例如跟踪结果，在重放缓冲区中存储内容，从重放缓冲区中获取内容）

EDIT after code was added to the question: 代码添加到问题后编辑：

After briefly inspecting the code, I suspect the easiest way to increase GPU utilization will be to reduce the training_interval parameter's value from 4 to, for example, 1 . 在简要检查代码之后，我怀疑增加GPU利用率的最简单方法是将training_interval参数的值从4减少到，例如1 。 Basically, all the tensorflow-based code will be running on GPU (should be at least), and all the other code is going to be on CPU. 基本上，所有基于张量流的代码都将在GPU上运行（至少应该是这样），而所有其他代码都将在CPU上运行。 In iterations where you do not train, this means that only the forwards-pass through the network to compute the Q-values runs on GPU, and all other code is on CPU. 在您不进行训练的迭代中，这意味着只有前向传递通过网络来计算Q值在GPU上运行，而所有其他代码都在CPU上运行。 In the iterations where you do train, you'll have much more code running on GPU: the additional forwards pass with samples from the replay buffer, and the matching backwards pass for updating the network's parameters. 在您进行训练的迭代中，您将在GPU上运行更多代码：附加转发通过重放缓冲区中的样本传递，并且匹配向后传递以更新网络参数。 So, if you want to increase GPU utilization, you'll want to do so by increasing how often you run the code that actually runs on GPU. 因此，如果您想提高GPU利用率，您可以通过增加运行实际在GPU上运行的代码的频率来实现。

Apart from that, I think it's also possible to move some of the computations you currently do outside of Tensorflow into Tensorflow (and therefore move them from CPU to GPU). 除此之外，我认为也可以将目前在Tensorflow之外进行的一些计算转移到Tensorflow中（因此将它们从CPU移到GPU）。 For example, you do the epsilon-greedy action selection outside of Tensorflow, whereas the OpenAI Baselines DQN implementation does that within Tensorflow. 例如，您在Tensorflow之外执行epsilon-greedy操作选择，而OpenAI Baselines DQN实现在Tensorflow中执行此操作。

Answer 2

Both the original DQN network and the network you use are very small for a Tesla P100 GPU. 对于特斯拉P100 GPU，原始DQN网络和您使用的网络都非常小。 If you want to utilize more, you can run multiple experiments on the same GPU. 如果您想要使用更多，您可以在同一GPU上运行多个实验。

运行Tensorflow时GPU利用率低

问题描述

2 个解决方案

解决方案1
1 2018-01-26 16:00:24

解决方案2
1 2018-01-26 19:38:01

运行Tensorflow时GPU利用率低

问题描述

2 个解决方案

解决方案1 1 2018-01-26 16:00:24

解决方案2 1 2018-01-26 19:38:01

解决方案1
1 2018-01-26 16:00:24

解决方案2
1 2018-01-26 19:38:01