简体   繁体   English

DQN 不收敛

[英]DQN not converging

I am trying to implement DQN in openai-gym's "lunar lander" environment.我正在尝试在 openai-gym 的“月球着陆器”环境中实施 DQN。

It shows no sign of converging after 3000 episodes for training.在训练 3000 集后,它没有显示收敛的迹象。 (for comparison, a very simple policy gradient method converges after 2000 episodes) (作为对比,一个非常简单的策略梯度方法在2000集之后收敛)

I went through my code for several times but can't find where's wrong.我多次检查我的代码,但找不到哪里出了问题。 I hope if someone here can point out where the problem is.我希望这里是否有人可以指出问题出在哪里。 Below is my code:下面是我的代码:

I use a simple fully-connected.network:我使用一个简单的 fully-connected.network:

class Net(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.main = nn.Sequential(
            nn.Linear(8, 16),
            nn.ReLU(),
            nn.Linear(16, 16),
            nn.ReLU(),
            nn.Linear(16, 4)
        )
    def forward(self, state):
        return self.main(state)

I use epsilon greedy when choosing actions, and the epsilon(start from 0.5) decreases exponentially overtime:我在选择动作时使用 epsilon greedy,epsilon(从 0.5 开始)随着时间的推移呈指数下降:

def sample_action(self, state):
        self.epsilon = self.epsilon * 0.99
        action_probs = self.network_train(state)
        random_number = random.random()
        if random_number < (1-self.epsilon):
            action = torch.argmax(action_probs, dim=-1).item()
        else:
            action = random.choice([0, 1, 2, 3])
        return action

When training, I use a replay buffer, batch size of 64, and gradient clipping:训练时,我使用重播缓冲区、批量大小为 64 和梯度裁剪:

def learn(self):
        if len(self.buffer) >= BATCH_SIZE:
            self.learn_counter += 1
            transitions = self.buffer.sample(BATCH_SIZE)
            batch = Transition(*zip(*transitions))
            state = torch.from_numpy(np.concatenate(batch.state)).reshape(-1, 8)
            action = torch.tensor(batch.action).reshape(-1, 1)
            reward = torch.tensor(batch.reward).reshape(-1, 1)
            state_value = self.network_train(state).gather(1, action)
            next_state = torch.from_numpy(np.concatenate(batch.next_state)).reshape(-1, 8)
            next_state_value = self.network_target(next_state).max(1)[0].reshape(-1, 1).detach()
            loss = F.mse_loss(state_value.float(), (self.DISCOUNT_FACTOR*next_state_value + reward).float())
            self.optim.zero_grad()
            loss.backward()
            for param in self.network_train.parameters():
                param.grad.data.clamp_(-1, 1)
            self.optim.step()

I also use a target.network, its parameters are updated every 100 timesteps:我还使用了一个 target.network,它的参数每 100 个时间步更新一次:

def update_network_target(self):
        if (self.learn_counter % 100) == 0:
            self.network_target.load_state_dict(self.network_train.state_dict())

BTW, I use a Adam optimizer and LR of 1e-3.顺便说一句,我使用 Adam 优化器和 1e-3 的 LR。

Solved.解决了。 Apparently the freq of updating target.network is too high.显然更新 target.network 的频率太高了。 I set it to every 10 episodes and fixed the problem.我将它设置为每 10 集并解决了问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM