神经网络后[np.arange(0, self.batch_size), action]的目的是什么？

Question

我遵循了 PyTorch 教程来学习强化学习（ TRAIN A MARIO-PLAYING RL AGENT ），但我对以下代码感到困惑：

current_Q = self.net(state, model="online")[np.arange(0, self.batch_size), action] # Q_online(s,a)

[np.arange(0, self.batch_size), action] 在神经网络之后的目的是什么？（我知道 TD_estimate 接受 state 和 action，只是在编程方面对此感到困惑）这是什么用法（放一个在self.net之后列出）？

教程中引用的更多相关代码：

class MarioNet(nn.Module):

def __init__(self, input_dim, output_dim):
    super().__init__()
    c, h, w = input_dim

    if h != 84:
        raise ValueError(f"Expecting input height: 84, got: {h}")
    if w != 84:
        raise ValueError(f"Expecting input width: 84, got: {w}")

    self.online = nn.Sequential(
        nn.Conv2d(in_channels=c, out_channels=32, kernel_size=8, stride=4),
        nn.ReLU(),
        nn.Conv2d(in_channels=32, out_channels=64, kernel_size=4, stride=2),
        nn.ReLU(),
        nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1),
        nn.ReLU(),
        nn.Flatten(),
        nn.Linear(3136, 512),
        nn.ReLU(),
        nn.Linear(512, output_dim),
    )

    self.target = copy.deepcopy(self.online)

    # Q_target parameters are frozen.
    for p in self.target.parameters():
        p.requires_grad = False

def forward(self, input, model):
    if model == "online":
        return self.online(input)
    elif model == "target":
        return self.target(input)

自我网：

self.net = MarioNet(self.state_dim, self.action_dim).float()

谢谢你的帮助！

Answer 1

本质上，这里发生的是网络的 output 被切片以获得 Q 表的所需部分。

[np.arange(0, self.batch_size), action]的（有点令人困惑的）索引索引每个轴。 因此，对于索引为 1 的轴，我们选择action指示的项目。 对于索引 0，我们选择 0 和self.batch_size之间的所有项目。

如果self.batch_size与这个数组的 0 维长度相同，那么这个 slice 可以简化为[:, action] ，这可能是大多数用户更熟悉的。

神经网络后[np.arange(0, self.batch_size), action]的目的是什么？

问题描述

1 个解决方案

解决方案1
1 2021-12-23 11:07:56

神经网络后[np.arange(0, self.batch_size), action]的目的是什么？

问题描述

1 个解决方案

解决方案1 1 2021-12-23 11:07:56

解决方案1
1 2021-12-23 11:07:56