您如何创建深度Q学习神经网络来解决诸如蛇之类的简单游戏？

Question

I have been working for the last four days to try and create a simple working neural network(NN) that learns. 最近四天我一直在努力尝试创建一个简单的可学习的神经网络（NN）。 I started off with the tower of Hanoi but that was quite tricky (doable with a Q-table)and no-one has really got good examples online so I decided to do it instead for snake game where there are lots of examples and tutorials. 我从河内塔楼开始，但是那很棘手（可以通过Q表完成），没有人在网上真的有很好的示例，因此我决定改为在蛇游戏中使用它，因为那里有很多示例和教程。 Long story short i have made a new super simple game where you have [0,0,0,0] and by picking 0, 1, 2, or 3 you change a 0 to a 1 or vice versa. 长话短说，我做了一个新的超级简单的游戏，您有[0,0,0,0]，通过选择0、1、2或3，您可以将0更改为1，反之亦然。 So picking 1 would give an output of [0,1,0,0] and picking 1 again goes back to [0,0,0,0]. 因此，选择1将给出[0,1,0,0]的输出，而再次选择1将返回到[0,0,0,0]。 Very easy 很容易

Despite the game being very easy I'm very much struggling to go from concept to practical as I have no education in coding. 尽管游戏非常简单，但由于我没有编码方面的知识，所以我仍然很难从概念到实际。

The final goal right now is to get the code below to be able to complete the game more than once. 现在的最终目标是获得下面的代码，以便能够多次完成游戏。 (it currently has run about 600 times and not once completed the 4 step problem) （它目前已经运行了约600次，并且没有一次完成4步问题）

Current network architecture is 4 inputs 4 nodes in 1st hidden layer and 4 outputs and I would like to keep it this way even if the hidden layer is redundant just so I can learn how to do it correctly for other problems. 当前的网络体系结构是第一个隐藏层中的4个输入4个节点和4个输出，即使隐藏层是冗余的，我也希望保持这种方式，以便我可以学习如何正确处理其他问题。

If you cant be bothered to read the code and I don't blame you, ill put my mental psudocode here: 如果您不愿意阅读代码，但我不怪您，请把我的心理伪代码放在这里：

setup variables, placeholders and import libraries 设置变量，占位符和导入库
run program 200 times to give it a chance to learn and each run has 20 turns 运行程序200次，使其有机会学习，每次运行有20转
run through the NN with "states" as the input and get out the output defined as "output" for future use 通过以“状态”为输入的NN进行运行，并获得定义为“输出”的输出以供将来使用
game code 游戏代码
the new reward for this specific game would just be the new set of states as (it has just occurred to me that this is the wrong way round ([0,1,0,0] for states should have rewards [1,0,1,1]) but i have already tried flipping it and it it still didn't work so that isn't the issue) 此特定游戏的新奖励将是新的州集，因为（我刚刚发现这是错误的做法（[0,1,0,0]州应获得奖励[1,0， 1,1]），但我已经尝试过翻转它，但它仍然没有起作用，所以这不是问题）
my thinking was that i could get the next Q value by just running the new states through the NN 我的想法是，只要通过NN运行新状态，我就能获得下一个Q值
this equation is taken directly from any deep q-learning tutorial on the internet and i think that maybe I have got this or one of the components for this wrong. 这个方程式直接取自互联网上任何深入的q-learning教程，我认为也许我已经理解了这个问题或其中一个错误原因。
run the gradient decent optimisation function 运行渐变体面优化功能

import tensorflow as tf             ## importing libraries
import random
import numpy as np

epsilon = 0.1                       ## create non tf variables
y = 0.4
memory = []
memory1 = []

input_ = tf.placeholder(tf.float32, [None, 4], name='input_') 
W1 = tf.Variable(tf.random_normal([4, 4], stddev=0.03), name='W1') 
b1 = tf.Variable(tf.random_normal([4]), name='b1')    
hidden_out = tf.add(tf.matmul(input_, W1), b1, name='hidden_out')   ## W for weights
hidden_out = tf.nn.relu(hidden_out)                                 ## b for bias'

W2 = tf.Variable(tf.random_normal([4, 4], stddev=0.03), name='W2')
b2 = tf.Variable(tf.random_normal([4]), name='b2')
Qout = tf.add(tf.matmul(hidden_out, W2), b2, name='Qout')
sig_out = tf.sigmoid(Qout, name='out')


Q_target = tf.placeholder(shape=(None,4), dtype=tf.float32)
loss = tf.reduce_sum(tf.square(Q_target - Qout))
optimiser = tf.train.GradientDescentOptimizer(learning_rate=y).minimize(loss)

init_op = tf.global_variables_initializer()

with tf.compat.v1.Session() as sess:
    sess.run(init_op)
    for epoch in range(200):         ## run game 200 times
        states = [0,0,0,0]
        for _ in range(20):          ## 20 turns to do the correct 4 moves
            if _ == 19:
                memory1.append(states)
            output = np.argmax(sess.run(sig_out, feed_dict={input_: [states]}))
            ## sig_out is the output put through a sigmoid function
            if random.random() < epsilon:       ## this is the code for the game 
                output = random.randint(0,3)    ## ...
            if states[output] == 0:             ## ...
                states[output] = 1              ## ...
            else:                               ## ...
                states[output] = 0              ## ...
            reward = states     
            Qout1 = sess.run(sig_out, feed_dict={input_: [states]})
            target = [reward + y*np.max(Qout1)]
            sess.run([optimiser,loss], feed_dict={input_: [states], Q_target: target})

I haven't got any error messages in a while with this, the actual result would ideally be [1,1,1,1] every time. 我有一段时间没有收到任何错误消息了，理想情况下，每次的实际结果都是[1,1,1,1]。

Thanks in advance for all of your help 预先感谢您的所有帮助

ps i couldn't think of an objective title for this, sorry ps我没想到这个客观的称呼，对不起

Answer 1

The reward value should be the objective value after an action has been taken. reward值应该是采取行动后的目标值。 In your case, you have set reward=states . 在您的情况下，您设置了reward=states 。 Since your function is attempting to maximize reward, the closer to [1, 1, 1, 1] your state gets, the more reward your AI should receive. 由于您的功能正在尝试最大化回报，因此您的状态越接近[1，1，1，1]，您的AI应获得的奖励就越多。

Perhaps a reward function such as reward = sum(states) will solve your problem. 也许诸如reward = sum(states)类的奖励函数将解决您的问题。

您如何创建深度Q学习神经网络来解决诸如蛇之类的简单游戏？

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-09-05 22:05:08

您如何创建深度Q学习神经网络来解决诸如蛇之类的简单游戏？

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-09-05 22:05:08

解决方案1
0 已采纳 2019-09-05 22:05:08