强化学习，为什么表现崩溃了？

Question

I am trying to train an agent on ViZDoom platform on the deadly_corridor scenario with A3C algorithm and TensorFlow on TITAN X GPU server, however, the performance collapsed after training about 2+ days. 我正尝试在TITAN X GPU服务器上的A3C算法和TensorFlow上的Deadly_corridor场景下在ViZDoom平台上训练一个代理，但是，经过大约2天以上的训练，性能崩溃了。 As you can see in the following picture. 如下图所示。

There are 6 demons in the corridor and the agent should kill at least 5 demons to get to the destination and get the vest. 走廊上有6个恶魔，特工应杀死至少5个恶魔才能到达目的地并获得背心。

Here is the code of the newtwork 这是newtwork的代码

with tf.variable_scope(scope):
    self.inputs = tf.placeholder(shape=[None, *shape, 1], dtype=tf.float32)
    self.conv_1 = slim.conv2d(activation_fn=tf.nn.relu, inputs=self.inputs, num_outputs=32,
                              kernel_size=[8, 8], stride=4, padding='SAME')
    self.conv_2 = slim.conv2d(activation_fn=tf.nn.relu, inputs=self.conv_1, num_outputs=64,
                              kernel_size=[4, 4], stride=2, padding='SAME')
    self.conv_3 = slim.conv2d(activation_fn=tf.nn.relu, inputs=self.conv_2, num_outputs=64,
                              kernel_size=[3, 3], stride=1, padding='SAME')
    self.fc = slim.fully_connected(slim.flatten(self.conv_3), 512, activation_fn=tf.nn.elu)

    # LSTM
    lstm_cell = tf.contrib.rnn.BasicLSTMCell(cfg.RNN_DIM, state_is_tuple=True)
    c_init = np.zeros((1, lstm_cell.state_size.c), np.float32)
    h_init = np.zeros((1, lstm_cell.state_size.h), np.float32)
    self.state_init = [c_init, h_init]
    c_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.c])
    h_in = tf.placeholder(tf.float32, [1, lstm_cell.state_size.h])
    self.state_in = (c_in, h_in)
    rnn_in = tf.expand_dims(self.fc, [0])
    step_size = tf.shape(self.inputs)[:1]
    state_in = tf.contrib.rnn.LSTMStateTuple(c_in, h_in)
    lstm_outputs, lstm_state = tf.nn.dynamic_rnn(lstm_cell,
                                                 rnn_in,
                                                 initial_state=state_in,
                                                 sequence_length=step_size,
                                                 time_major=False)
    lstm_c, lstm_h = lstm_state
    self.state_out = (lstm_c[:1, :], lstm_h[:1, :])
    rnn_out = tf.reshape(lstm_outputs, [-1, 256])

    # Output layers for policy and value estimations
    self.policy = slim.fully_connected(rnn_out,
                                       cfg.ACTION_DIM,
                                       activation_fn=tf.nn.softmax,
                                       biases_initializer=None)
    self.value = slim.fully_connected(rnn_out,
                                      1,
                                      activation_fn=None,
                                      biases_initializer=None)
    if scope != 'global' and not play:
        self.actions = tf.placeholder(shape=[None], dtype=tf.int32)
        self.actions_onehot = tf.one_hot(self.actions, cfg.ACTION_DIM, dtype=tf.float32)
        self.target_v = tf.placeholder(shape=[None], dtype=tf.float32)
        self.advantages = tf.placeholder(shape=[None], dtype=tf.float32)

        self.responsible_outputs = tf.reduce_sum(self.policy * self.actions_onehot, axis=1)

        # Loss functions
        self.policy_loss = -tf.reduce_sum(self.advantages * tf.log(self.responsible_outputs+1e-10))
        self.value_loss = tf.reduce_sum(tf.square(self.target_v - tf.reshape(self.value, [-1])))
        self.entropy = -tf.reduce_sum(self.policy * tf.log(self.policy+1e-10))

        # Get gradients from local network using local losses
        local_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)
        value_var, policy_var = local_vars[:-2] + [local_vars[-1]], local_vars[:-2] + [local_vars[-2]]
        self.var_norms = tf.global_norm(local_vars)

        self.value_gradients = tf.gradients(self.value_loss, value_var)
        value_grads, self.grad_norms_value = tf.clip_by_global_norm(self.value_gradients, 40.0)

        self.policy_gradients = tf.gradients(self.policy_loss, policy_var)
        policy_grads, self.grad_norms_policy = tf.clip_by_global_norm(self.policy_gradients, 40.0)

        # Apply local gradients to global network
        global_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'global')
        global_vars_value, global_vars_policy = \
            global_vars[:-2] + [global_vars[-1]], global_vars[:-2] + [global_vars[-2]]

        self.apply_grads_value = optimizer.apply_gradients(zip(value_grads, global_vars_value))
        self.apply_grads_policy = optimizer.apply_gradients(zip(policy_grads, global_vars_policy))

And the optimizer is 优化器是

optimizer = tf.train.RMSPropOptimizer(learning_rate=1e-5)

And here are some summaries of the gradients and norms 这是渐变和范数的一些摘要

Help some one can help me to tackle this problem. 帮助某人可以帮助我解决这个问题。

Answer 1

Now, personally, I think the reason why the performance of the agent collapsed is maybe the overoptimization of values. 现在，就我个人而言，我认为代理程序性能崩溃的原因可能是价值的过度优化。 I read a paper on Double DQN on this, you can read this paper DEEP REINFORCEMENT LEARNING WITH DOUBLE Q-LEARNING 我读了一篇有关Double DQN的论文，您可以阅读这篇论文进行双Q学习的深层强化学习

强化学习，为什么表现崩溃了？

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-02-09 12:02:12

强化学习，为什么表现崩溃了？

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-02-09 12:02:12

解决方案1
0 已采纳 2018-02-09 12:02:12