简体   繁体   English

DQN不学习

[英]DQN doesn't learn

I'm trying to implement a DQN in CarPole environment using Pytorch. I don't know why, but no matter how long I've tried to train the agent, even though the scores generally increase, they just fluctuate without maintaining high scores.我正在尝试使用 Pytorch 在 CarPole 环境中实现 DQN。我不知道为什么,但无论我尝试训练代理多长时间,即使分数普遍增加,它们也只是波动而没有保持高分。 The code was from a DQN tutorial written for tensorflow, which run normally, but when I try to convert to Pytorch, it doesn't learn.代码来自为tensorflow编写的DQN教程,运行正常,但是当我尝试转换为Pytorch时,它没有学习。 Here's the model:这是 model:

class Net(nn.Module):
def __init__(self, state_size, action_size):
    super(Net, self).__init__()
    self.fc1 = nn.Linear(state_size, 24)
    self.fc2 = nn.Linear(24, 24)
    self.fc3 = nn.Linear(24, action_size)
    
def forward(self, inputs):
    x = torch.from_numpy(inputs)
    x = F.relu(self.fc1(x.float()))
    x = F.relu(self.fc2(x))
    x = self.fc3(x)
    return x

class DQNAgent(nn.Module):
def __init__(self, state_size, action_size):
    super(DQNAgent, self).__init__()
    self.state_size = state_size
    self.action_size = action_size
    self.memory = deque(maxlen=2000)
    self.model, self.criterion, self.optimizer = self.build_model()
    
    self.epsilon = 1.0
    self.epsilon_decay = 0.995
    self.epsilon_min = 0.01
    self.gamma = 0.95
    
def build_model(self):
    model = Net(state_size, action_size)
    model = model.float()
    
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001) # might need to return criterion and optimizer
    
    return model, criterion, optimizer

def remember(self, state, action, reward, next_state, done):
    self.memory.append((state, action, reward, next_state, done))
    
def act(self, state):
    if np.random.rand() <= self.epsilon:
        return random.randrange(self.action_size)
    act_values = self.model(state)
    return np.argmax(act_values.detach().numpy())

def replay(self, batch_size):
    minibatch = random.sample(self.memory, batch_size)
    for state, action, reward, next_state, done in minibatch:
        if done:
            target = reward
        elif not done:
            target = reward + self.gamma*torch.max(self.model(next_state)) # --> a tensor
            
            
        target_f = self.model(state)
        target_f[0][action] = target
        
        # self.model.fit(state, target_f, epochs=1, verbose=0)
        loss = self.criterion(self.model(state), target_f)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
    if self.epsilon > self.epsilon_min:
        self.epsilon = self.epsilon*self.epsilon_decay
        
def load(self, name):
    pass

def save(self, name):
    pass

... and train: ...和训练:

for e in range(n_episodes):
state = env.reset()
state = np.reshape(state, [1, state_size])

for time in range(5000):
    # env.render()
    action = agent.act(state)
    
    next_state, reward, done, _ = env.step(action)
    reward = reward if not done else -10
    
    next_state = np.reshape(next_state, [1, state_size])
    
    agent.remember(state, action, reward, next_state, done)
    
    state = next_state
    
    if done:
        print("Episode {}/{}, score: {}, e: {:.2}".format(e, n_episodes, time, agent.epsilon))
        break
        
if len(agent.memory) > batch_size:
    agent.replay(batch_size)
    memory = agent.memory

If anyone could give me any suggestions/advices, it would be super appreciated.!!如果有人能给我任何建议/建议,我将不胜感激。!! Very confused now.现在很迷茫。 Thank you!谢谢!

In case of the unstability of model, you'd better transform the variable "state" and "next_state" into two tensors before inputing the.network.遇到model不稳定的情况,最好先将变量state和next_state转化为两个张量再输入.network。 This is your code:这是你的代码:

elif not done:
     target = reward + self.gamma*torch.max(self.model(next_state))

before inputing the state data into the.network, you should add this code:在将 state 数据输入 .network 之前,您应该添加以下代码:

     next_state = torch.tensor(next_state)

for the evaluated Q value according to state, you can change your code as:对于根据 state 评估的 Q 值,您可以将代码更改为:

target_f = self.model(torch.tensor(state), requires_grad=True)
loss = self.criterion(self.model(torch.tensor(state, requires_grad=True)), target_f)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM