简体   繁体   English

DeepRL:了解DQN的批次损失值

[英]DeepRL: understanding batch loss value for DQN

I am trying to understand how batch loss is computed. 我试图了解如何计算批次损失。 I have modelled my DQN as follows 我已将我的DQN建模如下

class DQN:

    def __init__(self, session, state_dim, action_dim, lr, nodes):
        self.sess = session
        self.s_dim = state_dim
        self.a_dim = action_dim
        self.learning_rate = lr
        self.nodes = nodes

        self.state = tf.placeholder("float", [None, self.s_dim], name="state_batch")
        # one-hot encoded action
        self.action = tf.placeholder("float", [None, self.a_dim], name="action_batch")
        self.predicted_q_value = tf.placeholder("float", [None, 1], name="prediction_batch")

        self.q_out = self.create_network()
        self.loss = tf.reduce_mean(tf.square(self.predicted_q_value - tf.reduce_sum(self.q_out * self.action)))
        self.optimize = tf.train.AdamOptimizer(self.learning_rate).minimize(self.loss)

    def create_network(self):
        h0 = tf.layers.dense(inputs=self.state, units=self.nodes, activation=tf.nn.relu)
        h1 = tf.layers.dense(inputs=h0, units=self.nodes, activation=tf.nn.relu)
        out = tf.layers.dense(inputs=h1, units=self.a_dim, activation=None)
        return out

    def train(self, state, action, predicted_q_value):
        return self.sess.run([self.loss, self.optimize], feed_dict={
            self.state: state,
            self.action: action,
            self.predicted_q_value: predicted_q_value
        })

    def predict(self, state):
        return self.sess.run(self.q_out, feed_dict={
            self.state: state
        })

As per my understanding the loss would be the mean of the losses of batch data. 根据我的理解,丢失将是批处理数据丢失的平均值。 But, I see that the total loss value is being multiplied by the square of the batch size. 但是,我看到总损失值乘以批次大小的平方。

sess = tf.Session()
nw = DQN(sess, 3, 3, 0.0001, 64)
sess.run(tf.global_variables_initializer())    

# batch size is 1
state_ip = [[1, 1, 1]]
action_ip = [[0, 1, 0]]
pred_val = [[0]]
print(nw.predict(state_ip))
loss, _ = nw.train(state_ip, action_ip, pred_val)
print(loss)

[[ 0.11640665  0.10434964 -0.31503427]]
0.010888848     # loss is as expected = (0 - 0.10434964)^2

If I pass data for batch size 2 with exactly same values 如果我以完全相同的值传递批次大小2的数据

state_ip = [[1, 1, 1], [1, 1, 1]]
action_ip = [[0, 1, 0], [0, 1, 0]]
pred_val = [[0], [0]]
print(nw.predict(state_ip))
loss, _ = nw.train(state_ip, action_ip, pred_val)
print(loss)

[[-0.28207895 -0.15026638 -0.0181574 ]
 [-0.28207895 -0.15026638 -0.0181574 ]]
0.09031994  # loss = (0 - -0.15026638)^2 * 2^2

As I have used tf.reduce_mean for the loss, I was expecting the loss to be the mean of the losses of the batch data. 由于我使用tf.reduce_mean进行损失,因此我期望损失是批处理数据损失的平均值。 Why is it being multiplied by the square of the batch size? 为什么要乘以批次大小的平方? Am I missing something basic here? 我在这里缺少基本的东西吗?

Your mistake is in how you compute the loss, specifically tf.reduce_sum(self.q_out * self.action)) computes a global across the whole tensor. 您的错误在于如何计算损失,特别是tf.reduce_sum(self.q_out * self.action))计算整个张量的全局值。 Step by step: 一步步:

  1. self.q_out * self.action gives you [[0,-0.15026638,0] [0,-0.15026638,0]] self.q_out * self.action给您[[0,-0.15026638,0] [0,-0.15026638,0]]
  2. tf.reduce_sum of (1) gives 2*-0.15026638=-0.30053276 tf.reduce_sum的(1)给出了2*-0.15026638=-0.30053276
  3. Now you subtract from 0 and square, resulting in 0.30053276**2=0.09031994 现在您从0减去平方,得到0.30053276**2=0.09031994

The mistake is, as you probably realized, in step 2, because you want to get [-0.15026638,-0.15026638] as output, and this can be achieved with the axis argument. 您可能已经意识到,该错误是在步骤2中,因为您希望获得[-0.15026638,-0.15026638]作为输出,而这可以通过axis参数来实现。 The correct way of computing the loss is, therefore: 因此,计算损失的正确方法是:

self.loss = tf.reduce_mean(tf.square(
    self.predicted_q_value - tf.reduce_sum(self.q_out * self.action, axis=1)
))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM