简体   繁体   English

Pytorch修改的DQN算法错误“未实现''的导数”

[英]Pytorch modified DQN algorithm error “the derivative for ' ' is not implemented”

I'm trying to construct new observation and applying it into DQN. 我正在尝试构建新的观察并将其应用于DQN。 I use pytorch DQN algorithm with my observation code. 我在观察代码中使用pytorch DQN算法。 its not finished yet, so data is not clear. 尚未完成,因此数据尚不清楚。

I changed some line in whole code as follows. 我在整个代码中更改了以下几行。

import sys, math
import random as rd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

######################################
from collections import namedtuple
from itertools import count
from PIL import Image

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.autograd import Variable
import torchvision.transforms as T


Transition = namedtuple('Transition',
                    ('state', 'action', 'next_state', 'reward'))

class ReplayMemory(object):

def __init__(self, capacity):
    self.capacity = capacity
    self.memory = []
    self.position = 0

def push(self, *args):
    """Saves a transition."""
    if len(self.memory) < self.capacity:
        self.memory.append(None)
    self.memory[self.position] = Transition(*args)
    self.position = (self.position + 1) % self.capacity

def sample(self, batch_size):
    return rd.sample(self.memory, batch_size)

def __len__(self):
    return len(self.memory)


class DQN(nn.Module):

def __init__(self):
    super(DQN, self).__init__()
    self.l1 = nn.Linear(5, 16)
    self.l2 = nn.Linear(16, 12)
    self.l3 = nn.Linear(12, 20)
    self.head = nn.Linear(20, 1)
def forward(self, x):
    x = F.relu(self.l1(x))
    x = F.relu(self.l2(x))
    x = F.relu(self.l3(x))
    return self.head(x.view(x.size(0), -1))

BATCH_SIZE = 5
GAMMA = 0.999
EPS_START = 0.9
EPS_END = 0.05
EPS_DECAY = 200
TARGET_UPDATE = 5
policy_net = DQN()
target_net = DQN()
target_net.load_state_dict(policy_net.state_dict())
target_net.eval()

optimizer = optim.RMSprop(policy_net.parameters())
memory = ReplayMemory(10000)

def optimize_model():
if len(memory) < BATCH_SIZE:
    return
transitions = memory.sample(BATCH_SIZE)
# Transpose the batch (see http://stackoverflow.com/a/19343/3343043 for
# detailed explanation).
batch = Transition(*zip(*transitions))
print("batch = ", batch, "\n")
# Compute a mask of non-final states and concatenate the batch elements


state_batch = Variable(torch.cat(batch.state))
print("state_batch = ", state_batch)
action_batch = Variable(torch.cat(batch.action))
print("action_batch = ", action_batch)
reward_batch = Variable(torch.cat(batch.reward), requires_grad = False)
print("reward_batch = ", reward_batch)
next_state_batch = Variable(torch.cat(batch.next_state), requires_grad = False)

# Compute Q(s_t, a) - the model computes Q(s_t), then we select the
# columns of actions taken
state_action_values = policy_net(state_batch)
print("state_action_values = ", state_action_values)

# Compute V(s_{t+1}) for all next states.
next_state_values = target_net(next_state_batch)
print("next_state_values = ", next_state_values)

# Compute the expected Q values
expected_state_action_values = (next_state_values * GAMMA) + reward_batch
print("expected next state values = ", expected_state_action_values)

# Compute Huber loss
loss = F.smooth_l1_loss(state_action_values, expected_state_action_values, reduce = False)

# Optimize the model
optimizer.zero_grad()
loss.backward()
for param in policy_net.parameters():
    param.grad.data.clamp_(-1, 1)
optimizer.step()



num_episodes = 5
for i_episode in range(num_episodes):
    # Initialize the environment and state
    drive = AutoDrive(20, 20, 0, 16, 0) # x/y/yaw/velocity/heading
    drive._make_observation(0, -1, -1, -1, -1, -1) random other vehicle     location, parameters

    stand = 3

    # exploit 1
    e = 1. / ((i_episode // 100) + 1) # conditioon for choosing action
    optimizer.zero_grad()

    for t in range(stand):
        # Select and perform an action

#         exploit 2
        if np.random.rand(1) > e:
            action = rd.randint(1, 4)
        else:
            action = np.argmax(drive._select_action(0.5, 0.5)) + 1 #index + 1
        print("state = ", drive.state, ", action = ", action, ", yaw = ",     drive.yaw, ", mag = ", drive.mag)
    state = drive.state
    drive._step(action)
    drive._calc_reward(0.5, 0.5)
    print(drive.reward)
    if (drive.reward == -10):
        break


    # Store the transition in memory
    state1 = torch.FloatTensor(state).view(1, 5)
    state2 = torch.FloatTensor(drive.state).view(1, 5)
    action = torch.FloatTensor([float(action)]).view(1, 1)
    reward = torch.FloatTensor([drive.reward]).view(1, 1)
    memory.push(state1, action\
                , state2, reward)

    # Perform one step of the optimization (on the target network)
    optimize_model()
    if done:
        episode_durations.append(t + 1)
        plot_durations()
        break
# Update the target network
if i_episode % TARGET_UPDATE == 0:
    target_net.load_state_dict(policy_net.state_dict())

the error occurs in the loss function 错误发生在损失函数中

  File "<ipython-input-190-29dcdbbf0383>", line 1, in <module>
    runfile('C:/Users/desktop/.spyder-py3/temp.py',     wdir='C:/Users/desktop/.spyder-py3')

  File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)

  File "C:\ProgramData\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/desktop/.spyder-py3/temp.py", line 441, in <module>
optimize_model()

  File "C:/Users/desktop/.spyder-py3/temp.py", line 362, in optimize_model
loss = F.smooth_l1_loss(state_action_values, expected_state_action_values, reduce = False)

RuntimeError: the derivative for 'target' is not implemented

the inputs in the function are as follows. 函数中的输入如下。

expected next state values =  Variable containing:
  8.9615
 12.0198
 12.0488
 12.2920
 13.9062
[torch.FloatTensor of size 5x1]

state_action_values =  Variable containing:
 0.3765
 0.5196
 0.4587
 0.3765
 0.5636
[torch.FloatTensor of size 5x1]

what i have to do? 我该怎么办? im really beginner of it, so helpful advice are welcome 我真的是初学者,欢迎提供有用的建议

It may be because you are trying to backpropagate through a network on which you called .eval() . 可能是因为您试图通过调用.eval()的网络进行反向传播。 Instead, detach the target variable from its computational graph: 相反,将目标变量与其计算图分离:

loss = F.smooth_l1_loss(state_action_values, expected_state_action_values.detach(), reduce = False)

Now when you call .backward() on the loss, PyTorch will not try to compute derivatives wrt the target network parameters. 现在,当您对丢失调用.backward()时,PyTorch将不会尝试使用目标网络参数来计算派生类。

look at the loss functinon smooth_l1_loss(input, target) , the second parameter target should be a tensor without grad. 看一下损失函数smooth_l1_loss(input, target) ,第二个参数target应该是一个没有梯度的张量。 target.requires_grad should be False. target.requires_grad应该为False。

expected_state_action_values = (next_state_values * GAMMA) + reward_batch Expected_state_action_values =(next_state_values * GAMMA)+奖励批次

I can see that your expected_state_action_values was calculated by next_state_values in your code. 我可以看到你的expected_state_action_values被计算next_state_values在你的代码。 But next_state_values = target_net(next_state_batch) , so expected_state_action_values has grad attribute because of next_state_values which has grad attribute. 但是next_state_values = target_net(next_state_batch) ,所以expected_state_action_values具有grad属性,因为next_state_values具有grad属性。 so you should: 所以你应该:

loss = F.smooth_l1_loss(state_action_values, expected_state_action_values.detach(), reduce = False)

or: 要么:

loss = F.smooth_l1_loss(state_action_values, expected_state_action_values.data, reduce = False)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM