DQN理解输入和output（层）

Question

I have a question about the input and output (layer) of a DQN.我对 DQN 的输入和 output（层）有疑问。

eg例如

Two points: P1(x1, y1) and P2(x2, y2)两点：P1(x1, y1) 和 P2(x2, y2)

P1 has to walk towards P2 P1 必须走向 P2

I have the following information:我有以下信息：

Current position P1 (x/y)当前 position P1 (x/y)
Current position P2 (x/y)当前 position P2 (x/y)
Distance to P1-P2 (x/y)到 P1-P2 的距离 (x/y)
Direction to P1-P2 (x/y)到 P1-P2 的方向 (x/y)

P1 has 4 possible actions: P1 有 4 个可能的操作：

Up向上
Down下
Left剩下
Right正确的

How do I have to setup the input and output layer?如何设置输入和 output 层？

4 input nodes 4个输入节点
4 output nodes 4 output 节点

Is that correct?那是对的吗？ What do I have to do with the output?我与 output 有什么关系？ I got 4 arrays with 4 values each as output.我得到了 4 个 arrays，每个有 4 个值作为 output。 Is doing argmax on the output correct?在 output 上做 argmax 是否正确？

Edit:编辑：

Input / State:输入/State：

# Current position P1
state_pos = [x_POS, y_POS]
state_pos = np.asarray(state_pos, dtype=np.float32)
# Current position P2
state_wp = [wp_x, wp_y]
state_wp = np.asarray(state_wp, dtype=np.float32)
# Distance P1 - P2 
state_dist_wp = [wp_x - x_POS, wp_y - y_POS]
state_dist_wp = np.asarray(state_dist_wp, dtype=np.float32)
# Direction P1 - P2
distance = [wp_x - x_POS, wp_y - y_POS]
norm = math.sqrt(distance[0] ** 2 + distance[1] ** 2)
state_direction_wp = [distance[0] / norm, distance[1] / norm]
state_direction_wp = np.asarray(state_direction_wp, dtype=np.float32)
state = [state_pos, state_wp, state_dist_wp, state_direction_wp]
state = np.array(state)

Network:网络：

def __init__(self):
    self.q_net = self._build_dqn_model()
    self.epsilon = 1 

def _build_dqn_model(self):
    q_net = Sequential()
    q_net.add(Dense(4, input_shape=(4,2), activation='relu', kernel_initializer='he_uniform'))
    q_net.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
    q_net.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
    q_net.add(Dense(4, activation='linear', kernel_initializer='he_uniform'))
    rms = tf.optimizers.RMSprop(lr = 1e-4)
    q_net.compile(optimizer=rms, loss='mse')
    return q_net

def random_policy(self, state):
    return np.random.randint(0, 4)

def collect_policy(self, state):
    if np.random.random() < self.epsilon:
        return self.random_policy(state)
    return self.policy(state)

def policy(self, state):
    # Here I get 4 arrays with 4 values each as output
    action_q = self.q_net(state)

Answer 1

Adding input_shape=(4,2) in the first Dense layer is causing the output shape to be (None, 4, 4) .在第一个 Dense 层中添加input_shape=(4,2)导致 output 形状为(None, 4, 4) 。 Defining q_net the following way solves it:用以下方式定义 q_net 可以解决它：

q_net = Sequential()
q_net.add(Reshape(target_shape=(8,), input_shape=(4,2)))
q_net.add(Dense(128,  activation='relu', kernel_initializer='he_uniform'))
q_net.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
q_net.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
q_net.add(Dense(4, activation='linear', kernel_initializer='he_uniform'))
rms = tf.optimizers.RMSprop(lr = 1e-4)
q_net.compile(optimizer=rms, loss='mse')
return q_net

Here, q_net.add(Reshape(target_shape=(8,), input_shape=(4,2))) reshapes the (None, 4, 2) input to (None, 8) [Here, None represents the batch shape].在这里， q_net.add(Reshape(target_shape=(8,), input_shape=(4,2)))将 (None, 4, 2) 输入重塑为 (None, 8) [这里，None 表示批处理形状]。

To verify, print q_net.output_shape and it should be (None, 4) [Whereas in the previous case it was (None, 4, 4) ].为了验证，打印q_net.output_shape它应该是(None, 4) [而在前一种情况下它是(None, 4, 4) ]。

You also need to do one more thing.你还需要做一件事。 Recall that input_shape does not take batch shape into account.回想一下input_shape没有考虑批量形状。 What I mean is, input_shape=(4,2) expects inputs of shape (batch_shape, 4, 2).我的意思是， input_shape=(4,2)期望输入形状为 (batch_shape, 4, 2)。 Verify it by printing q_net.input_shape and it should output (None, 4, 2) .通过打印q_net.input_shape进行验证，它应该是 output (None, 4, 2) 。 Now, what you have to do is - add a batch dimension to your input.现在，您需要做的是 - 在您的输入中添加一个批次维度。 Simply you can do the following:只需执行以下操作：

state_with_batch_dim = np.expand_dims(state,0)

And pass state_with_batch_dim to q_net as input.并将state_with_batch_dim作为输入传递给 q_net。 For example, you can call the policy method you wrote like policy(np.expand_dims(state,0)) and get an output of dimension (batch_shape, 4) [in this case (1,4) ].例如，您可以调用您编写的policy方法，如policy(np.expand_dims(state,0))并获取维度(batch_shape, 4) [在本例中为(1,4) ] 的 output。

And here are the answers to your initial questions:以下是您最初问题的答案：

Your output layer should have 4 nodes (units).您的 output 层应该有 4 个节点（单元）。
Your first dense layer does not necessarily have to have 4 nodes (units).您的第一个密集层不一定必须有 4 个节点（单元）。 If you consider the Reshape layer, the notion of nodes or units does not fit there.如果您考虑Reshape层，则节点或单元的概念不适合那里。 You can think of the Reshape layer as a placeholder that takes a tensor of shape (None, 4, 2) and outputs a reshaped tensor of shape (None, 8).您可以将Reshape层视为一个占位符，它采用形状为 (None, 4, 2) 的张量并输出形状为 (None, 8) 的重构张量。
Now, you should get outputs of shape (None, 4) - there, the 4 values represent the q-values of 4 corresponding actions.现在，您应该得到形状 (None, 4) 的输出 - 这 4 个值代表 4 个相应动作的 q 值。 No need to do argmax here to find the q-values.无需在此处执行argmax即可找到 q 值。

Answer 2

It could make sense to feed the DQN some information on the direction it's currently facing too.向 DQN 提供一些有关其当前所面临方向的信息也可能是有意义的。 You could set it up as (Current Pos X, Current Pos Y, X From Goal, Y From Goal, Direction).您可以将其设置为 (Current Pos X, Current Pos Y, X From Goal, Y From Goal, Direction)。

The output layer should just be (Up, Left, Down, Right) in an order you determine. output 层应该按照您确定的顺序（上、左、下、右）。 An Argmax layer is suitable for the problem. Argmax 层适用于该问题。 Exact code depends on if you using TF / Pytorch.确切的代码取决于您是否使用 TF / Pytorch。

DQN理解输入和output（层）

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-12-02 17:25:12

解决方案2
1 2020-12-01 08:41:46

DQN理解输入和output（层）

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-12-02 17:25:12

解决方案2 1 2020-12-01 08:41:46

解决方案1
2 已采纳 2020-12-02 17:25:12

解决方案2
1 2020-12-01 08:41:46