简体   繁体   English

为什么我在Breakout v0的DQN代理商不学习?

[英]why my DQN's agent at Breakout v0 doesnt learn?

i used chainerRL and tried Breakout v0. 我使用chainerRL并尝试了Breakout v0。

i run this code. 我运行这段代码。 it did work but my agent couldnt get reward (reward always under 5 score). 它确实有效,但我的经纪人无法获得奖励(奖励始终低于5分)。

python 2.7 ubuntu 14.04 python 2.7 ubuntu 14.04

please teach me why i cant . 请教我为什么我不能。

also i cant understand why heres number is 972 >l5=L.Linear(972, 512) 我也无法理解为什么这里的数字是972> l5 = L.Linear(972,512)

import chainer 
import chainer.functions as F
import chainer.links as L
import chainerrl
import gym
import numpy as np

from chainer import cuda

import datetime
from skimage.color import rgb2gray
from skimage.transform import resize

env = gym.make('Breakout-v0')
obs = env.reset()

print("observation space   : {}".format(env.observation_space))
print("action space        : {}".format(env.action_space))

action = env.action_space.sample()
obs, r, done, info = env.step(action)
class QFunction(chainer.Chain):
def __init__(self,obs_size, n_action):
    super(QFunction, self).__init__(
        l1=L.Convolution2D(obs_size, 4, ksize=2,pad=1),#210x160
        bn1=L.BatchNormalization(4),
        l2=L.Convolution2D(4, 4, ksize=2,pad=1),#105x80
        bn2=L.BatchNormalization(4),
        #l3=L.Convolution2D(64, 64, ksize=2, pad=1),#100x100
        #bn3=L.BatchNormalization(64),
        #l4=L.Convolution2D(64, 3, ksize=2,pad=1),#50x50
       # bn4=L.BatchNormalization(3),

        l5=L.Linear(972, 512),
        out=L.Linear(512, n_action, initialW=np.zeros((n_action, 512), dtype=np.float32))
    )

def __call__(self, x, test=False):

    h1=F.relu(self.bn1(self.l1(x)))
    h2=F.max_pooling_2d(F.relu(self.bn2(self.l2(h1))),2)
    #h3=F.relu(self.bn3(self.l3(h2)))
    #h4=F.max_pooling_2d(F.relu(self.bn4(self.l4(h3))),2)
    #print h4.shape

    return chainerrl.action_value.DiscreteActionValue(self.out(self.l5(h2)))

n_action = env.action_space.n
obs_size = env.observation_space.shape[0] #(210,160,3)
q_func = QFunction(obs_size, n_action)

optimizer = chainer.optimizers.Adam(eps=1e-2)
optimizer.setup(q_func)

gamma = 0.99

explorer = chainerrl.explorers.ConstantEpsilonGreedy(
epsilon=0.2, random_action_func=env.action_space.sample)

replay_buffer = chainerrl.replay_buffer.ReplayBuffer(capacity=10 ** 6)

phi = lambda x: x.astype(np.float32, copy=False)
agent = chainerrl.agents.DoubleDQN(
q_func, optimizer, replay_buffer, gamma, explorer,
minibatch_size=4, replay_start_size=100, update_interval=10,
target_update_interval=10, phi=phi)

last_time = datetime.datetime.now()
n_episodes = 10000
for i in range(1, n_episodes + 1):
obs = env.reset()

reward = 0
done = False
R = 0

while not done:
    env.render()
    action = agent.act_and_train(obs, reward)
    obs, reward, done, _ = env.step(action)


    if reward != 0:
        R += reward

elapsed_time = datetime.datetime.now() - last_time
print('episode:', i, 
      'reward:', R,
     )
last_time = datetime.datetime.now()

if i % 100 == 0:
    filename = 'agent_Breakout' + str(i)
    agent.save(filename)

agent.stop_episode_and_train(obs, reward, done)
print('Finished.')

As the author of ChainerRL, if you want to tackle Atari environments, I recommend you start from examples/ale/train_*.py and customize it step by step. 作为ChainerRL的作者,如果您想处理Atari环境,建议您从examples/ale/train_*.py 。py开始并逐步对其进行自定义。 Deep reinforcement learning is really sensitive to changes in hyper-parameters and network architectures, and if you introduce a lot of changes at a time, it would be hard to tell which change is responsible for a failure of training. 深度强化学习对超参数和网络体系结构的更改确实很敏感,并且如果您一次引入很多更改,将很难分辨出哪些更改是导致培训失败的原因。

I tried running your script while printing statistics via agent.get_statistics() and found Q values were getting too high, which indicates training didn't go well. 我尝试通过agent.get_statistics()打印统计信息时运行您的脚本,结果发现Q值变得过高,这表明培训效果不佳。

$ python yourscript.py
[2017-07-10 18:14:45,309] Making new env: Breakout-v0
observation space   : Box(210, 160, 3)
action space        : Discrete(6)
episode: 1 reward: 0
[('average_q', 0.0), ('average_loss', 0.0)]
episode: 2 reward: 1.0
[('average_q', 0.0), ('average_loss', 0.0)]
episode: 3 reward: 0
[('average_q', 0.0), ('average_loss', 0.0)]
episode: 4 reward: 0
[('average_q', 0.0), ('average_loss', 0.0)]
episode: 5 reward: 2.0
[('average_q', 0.0), ('average_loss', 0.0)]
episode: 6 reward: 0
[('average_q', 0.0), ('average_loss', 0.0)]
episode: 7 reward: 1.0
[('average_q', 0.0), ('average_loss', 0.0)]
episode: 8 reward: 2.0
[('average_q', 0.0), ('average_loss', 0.0)]
episode: 9 reward: 1.0
[('average_q', 0.0), ('average_loss', 0.0)]
episode: 10 reward: 2.0
[('average_q', 0.05082079044988309), ('average_loss', 0.0028927958279822935)]
episode: 11 reward: 4.0
[('average_q', 7.09331367665307), ('average_loss', 0.0706595716528489)]
episode: 12 reward: 0
[('average_q', 17.418094266218915), ('average_loss', 0.251431955409951)]
episode: 13 reward: 1.0
[('average_q', 40.903169833428954), ('average_loss', 1.0959175910071859)]
episode: 14 reward: 2.0
[('average_q', 115.25579476118122), ('average_loss', 2.513677824600575)]
episode: 15 reward: 2.0
[('average_q', 258.7392539556941), ('average_loss', 6.20968827451279)]
episode: 16 reward: 1.0
[('average_q', 569.6735852049942), ('average_loss', 19.295426012437833)]
episode: 17 reward: 4.0
[('average_q', 1403.8461185742353), ('average_loss', 32.6092646561004)]
episode: 18 reward: 1.0
[('average_q', 2138.438909199657), ('average_loss', 44.90832410172697)]
episode: 19 reward: 1.0
[('average_q', 3112.752923036582), ('average_loss', 88.50687458947431)]
episode: 20 reward: 1.0
[('average_q', 4138.601621651058), ('average_loss', 106.09160137599618)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM