简体   繁体   English

为什么我的Deep Q Network不能掌握简单的Gridworld(Tensorflow)? (如何评估Deep-Q-Net)

[英]Why doesn't my Deep Q Network master a simple Gridworld (Tensorflow)? (How to evaluate a Deep-Q-Net)

I try to familiarize myself with Q-learning and Deep Neural Networks, currently try to implement Playing Atari with Deep Reinforcement Learning . 我尝试熟悉Q-learning和Deep Neural Networks,目前尝试使用Deep Reinforcement Learning实现Play Atari

To test my implementation and play around with it, I tought I try a simple gridworld. 为了测试我的实现并玩它,我试着尝试一个简单的gridworld。 Where i have a N x N Grid and start in the top left corner and finishes at the bottom right. 我有一个N x N网格,从左上角开始,在右下角结束。 The possible actions are: left, up, right, down. 可能的操作是:向左,向上,向右,向下。

Even though my implementation has become very similar to this (hope its a good one) it dosn't seem to learn anything. 即使我的实现与非常相似(希望它是一个好的),它似乎似乎没有学到任何东西。 Looking at the total steps it needs to finish(I guess the average would be aroung 500 with a gridsize of 10x10, but there also very low and high values), it seams more random than anything else to me. 看看它需要完成的总步数(我猜平均值将达到500,网格大小为10x10,但也有非常低和高的值),它对我来说比其他任何东西都更随机。

I tried it with and without convolutional layers and played around with all the parameters but to be honest, I've no idea if something with my implementation is wrong or it needs to train longer(I let it train for quite a time) or what ever. 我尝试使用和不使用卷积层并使用所有参数,但说实话,我不知道我的实现是否有问题或需要更长时间训练(我让它训练了很长时间)或者什么永远。 But at least it seams to converge, here the plot of the loss value one training session: 但至少它接缝会聚,这里是一个训练课程的损失值的情节:

损失图像

So what is the problem in this case? 那么这种情况下的问题是什么?

But also and maybe more importantly how can I "debug" this Deep-Q-Nets, in supervised training there are training, test and validation sets and for example with precision and recall it is possible to evaluate them. 但也可能更重要的是我如何“调试”这个Deep-Q-Nets,在监督培训中有训练,测试和验证集,例如精确和召回,可以对它们进行评估。 What options do I have for unsupervised learning with Deep-Q-Nets, so that the next time maybe I can fix it myself? 对于使用Deep-Q-Nets的无监督学习,我有哪些选择,以便下次我可以自己修复它?

Finally here is the code: 最后这里是代码:

This is the network: 这是网络:

ACTIONS = 5

# Inputs
x = tf.placeholder('float', shape=[None, 10, 10, 4])
y = tf.placeholder('float', shape=[None])
a = tf.placeholder('float', shape=[None, ACTIONS])

# Layer 1 Conv1 - input
with tf.name_scope('Layer1'):
    W_conv1 = weight_variable([8,8,4,8])
    b_conv1 = bias_variable([8])    
    h_conv1 = tf.nn.relu(conv2d(x, W_conv1, 5)+b_conv1)

# Layer 2 Conv2 - hidden1 
with tf.name_scope('Layer2'):
    W_conv2 = weight_variable([2,2,8,8])
    b_conv2 = bias_variable([8])
    h_conv2 = tf.nn.relu(conv2d(h_conv1, W_conv2, 1)+b_conv2)
    h_conv2_max_pool = max_pool_2x2(h_conv2)

# Layer 3 fc1 - hidden 2
with tf.name_scope('Layer3'):
    W_fc1 = weight_variable([8, 32])
    b_fc1 = bias_variable([32])
    h_conv2_flat = tf.reshape(h_conv2_max_pool, [-1, 8])
    h_fc1 = tf.nn.relu(tf.matmul(h_conv2_flat, W_fc1)+b_fc1)

# Layer 4 fc2 - readout
with tf.name_scope('Layer4'):
    W_fc2 = weight_variable([32, ACTIONS])
    b_fc2 = bias_variable([ACTIONS])
    readout = tf.matmul(h_fc1, W_fc2)+ b_fc2

# Training
with tf.name_scope('training'):
    readout_action = tf.reduce_sum(tf.mul(readout, a), reduction_indices=1)
    loss = tf.reduce_mean(tf.square(y - readout_action))
    train = tf.train.AdamOptimizer(1e-6).minimize(loss)

    loss_summ = tf.scalar_summary('loss', loss)

And here the training: 在这里培训:

# 0 => left
# 1 => up
# 2 => right
# 3 => down
# 4 = noop

ACTIONS = 5
GAMMA = 0.95
BATCH = 50
TRANSITIONS = 2000
OBSERVATIONS = 1000
MAXSTEPS = 1000

D = deque()
epsilon = 1

average = 0
for episode in xrange(1000):
    step_count = 0
    game_ended = False

    state = np.array([0.0]*100, float).reshape(100)
    state[0] = 1

    rsh_state = state.reshape(10,10)
    s = np.stack((rsh_state, rsh_state, rsh_state, rsh_state), axis=2)

    while step_count < MAXSTEPS and not game_ended:
        reward = 0
        step_count += 1

        read = readout.eval(feed_dict={x: [s]})[0]

        act = np.zeros(ACTIONS)
        action = random.randint(0,4)
        if len(D) > OBSERVATIONS and random.random() > epsilon:
            action = np.argmax(read)
        act[action] = 1

        # play the game
        pos_idx = state.argmax(axis=0)
        pos = pos_idx + 1

        state[pos_idx] = 0
        if action == 0 and pos%10 != 1: #left
            state[pos_idx-1] = 1
        elif action == 1 and pos > 10: #up
            state[pos_idx-10] = 1
        elif action == 2 and pos%10 != 0: #right
            state[pos_idx+1] = 1
        elif action == 3 and pos < 91: #down
            state[pos_idx+10] = 1
        else: #noop
            state[pos_idx] = 1
            pass

        if state.argmax(axis=0) == pos_idx and reward > 0:
            reward -= 0.0001

        if step_count == MAXSTEPS:
            reward -= 100
        elif state[99] == 1: # reward & finished
            reward += 100
            game_ended = True
        else:
            reward -= 1


        s_old = np.copy(s)
        s = np.append(s[:,:,1:], state.reshape(10,10,1), axis=2)

        D.append((s_old, act, reward, s))
        if len(D) > TRANSITIONS:
            D.popleft()

        if len(D) > OBSERVATIONS:
            minibatch = random.sample(D, BATCH)

            s_j_batch = [d[0] for d in minibatch]
            a_batch = [d[1] for d in minibatch]
            r_batch = [d[2] for d in minibatch]
            s_j1_batch = [d[3] for d in minibatch]

            readout_j1_batch = readout.eval(feed_dict={x:s_j1_batch})
            y_batch = []

            for i in xrange(0, len(minibatch)):
                y_batch.append(r_batch[i] + GAMMA * np.max(readout_j1_batch[i]))

            train.run(feed_dict={x: s_j_batch, y: y_batch, a: a_batch})

        if epsilon > 0.05:
            epsilon -= 0.01

I appreciate every help and ideas you may have! 我感谢您的每一个帮助和想法!

For those interested, I ajusted the parameters and the model further but the biggest improvment was switching to a simple feed forward network with 3 Layers and about 50 neurons in the hidden layer. 对于那些感兴趣的人,我进一步调整了参数和模型,但最大的改进是切换到隐藏层中有3层和大约50个神经元的简单前馈网络。 For me it then converged in a pretty decent time. 对我而言,它在相当不错的时间汇聚。

Btw further tips for debuggin are appreciated! Btw进一步的调试提示表示赞赏!

So it's quite a time ago that i wrote this question but it seams that there is still quite some interest and request for the running code i finally decided to create a github repository 所以很久以前我写了这个问题,但是它仍然有一些兴趣和对运行代码的请求我最终决定创建一个github存储库

Since it's quite a time ago i wrote it and so on it won't run out of the box, but it shouldn't be that hard to get it running. 因为它是很久以前我写的,所以它不会开箱即用,但它应该不是那么难以运行。 So here is the deep q network and example i wrote at the time which back then worked, hope you enjoy: Link to deep q repository 所以这里是深q网络和我当时写的例子,然后工作,希望你喜欢: 链接到深q库

Would be nice to see some contribution and if you fix it and get it running make a pull request! 很高兴看到一些贡献,如果你修复它并让它运行发出拉动请求!

I have implemented a simple toy DQN without CNN layers, and it works. 我已经实现了一个没有CNN层的简单玩具DQN,它可以工作。 Here are some findings during my implementation , hope it will help. 以下是我实施过程中的一些发现,希望对您有所帮助。

  1. According to DeepMind's paper they didn't use max pooling layer, the reason is that the image will become position invariant which is not good for the game. 根据DeepMind的论文,他们没有使用max pooling层,原因是图像将变为位置不变,这对游戏不利。 The position of agent is crucial for the information of the game. 代理人的位置对于游戏信息至关重要。 DQN Architecture DQN架构

  2. If you want to skip the CNN first use gym environment(Like what I have done for the toy implementation), during my development, here are couple of things I found: 如果你想跳过CNN首次使用健身房环境(就像我为玩具实施所做的那样),在我的开发过程中,我发现了以下几件事:

    • Encode your state of environment by one-hot encoding, it will increase the training efficiency. 通过单热编码对您的环境状态进行编码,这将提高培训效率。
    • I only use a matrix of weights with [number of state, number of action] shape, to do the matrix multiplication with input one-hot encoded state. 我只使用具有[状态数,动作数]形状的权重矩阵来进行矩阵乘法,输入一热编码状态。 No bias, no activation function(It will increase training time, I assume, it never work after I add other layer or anything). 没有偏见,没有激活功能(它会增加训练时间,我认为,在我添加其他层或任何东西后它永远不会工作)。

This are two things I found extremely crucial for my implementation to work, I am not fully understand the reason behind it, hope my answer can give you a little bit insight. 这是我发现对我的实现工作至关重要的两件事,我不完全理解它背后的原因,希望我的回答可以给你一点点洞察力。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM