您如何创建深度Q学习神经网络来解决诸如蛇之类的简单游戏？

Question

最近四天我一直在努力尝试创建一个简单的可学习的神经网络（NN）。 我从河内塔楼开始，但是那很棘手（可以通过Q表完成），没有人在网上真的有很好的示例，因此我决定改为在蛇游戏中使用它，因为那里有很多示例和教程。 长话短说，我做了一个新的超级简单的游戏，您有[0,0,0,0]，通过选择0、1、2或3，您可以将0更改为1，反之亦然。 因此，选择1将给出[0,1,0,0]的输出，而再次选择1将返回到[0,0,0,0]。 很容易

尽管游戏非常简单，但由于我没有编码方面的知识，所以我仍然很难从概念到实际。

现在的最终目标是获得下面的代码，以便能够多次完成游戏。 （它目前已经运行了约600次，并且没有一次完成4步问题）

当前的网络体系结构是第一个隐藏层中的4个输入4个节点和4个输出，即使隐藏层是冗余的，我也希望保持这种方式，以便我可以学习如何正确处理其他问题。

如果您不愿意阅读代码，但我不怪您，请把我的心理伪代码放在这里：

设置变量，占位符和导入库
运行程序200次，使其有机会学习，每次运行有20转
通过以“状态”为输入的NN进行运行，并获得定义为“输出”的输出以供将来使用
游戏代码
此特定游戏的新奖励将是新的州集，因为（我刚刚发现这是错误的做法（[0,1,0,0]州应获得奖励[1,0， 1,1]），但我已经尝试过翻转它，但它仍然没有起作用，所以这不是问题）
我的想法是，只要通过NN运行新状态，我就能获得下一个Q值
这个方程式直接取自互联网上任何深入的q-learning教程，我认为也许我已经理解了这个问题或其中一个错误原因。
运行渐变体面优化功能

import tensorflow as tf             ## importing libraries
import random
import numpy as np

epsilon = 0.1                       ## create non tf variables
y = 0.4
memory = []
memory1 = []

input_ = tf.placeholder(tf.float32, [None, 4], name='input_') 
W1 = tf.Variable(tf.random_normal([4, 4], stddev=0.03), name='W1') 
b1 = tf.Variable(tf.random_normal([4]), name='b1')    
hidden_out = tf.add(tf.matmul(input_, W1), b1, name='hidden_out')   ## W for weights
hidden_out = tf.nn.relu(hidden_out)                                 ## b for bias'

W2 = tf.Variable(tf.random_normal([4, 4], stddev=0.03), name='W2')
b2 = tf.Variable(tf.random_normal([4]), name='b2')
Qout = tf.add(tf.matmul(hidden_out, W2), b2, name='Qout')
sig_out = tf.sigmoid(Qout, name='out')


Q_target = tf.placeholder(shape=(None,4), dtype=tf.float32)
loss = tf.reduce_sum(tf.square(Q_target - Qout))
optimiser = tf.train.GradientDescentOptimizer(learning_rate=y).minimize(loss)

init_op = tf.global_variables_initializer()

with tf.compat.v1.Session() as sess:
    sess.run(init_op)
    for epoch in range(200):         ## run game 200 times
        states = [0,0,0,0]
        for _ in range(20):          ## 20 turns to do the correct 4 moves
            if _ == 19:
                memory1.append(states)
            output = np.argmax(sess.run(sig_out, feed_dict={input_: [states]}))
            ## sig_out is the output put through a sigmoid function
            if random.random() < epsilon:       ## this is the code for the game 
                output = random.randint(0,3)    ## ...
            if states[output] == 0:             ## ...
                states[output] = 1              ## ...
            else:                               ## ...
                states[output] = 0              ## ...
            reward = states     
            Qout1 = sess.run(sig_out, feed_dict={input_: [states]})
            target = [reward + y*np.max(Qout1)]
            sess.run([optimiser,loss], feed_dict={input_: [states], Q_target: target})

我有一段时间没有收到任何错误消息了，理想情况下，每次的实际结果都是[1,1,1,1]。

预先感谢您的所有帮助

ps我没想到这个客观的称呼，对不起

Answer 1

reward值应该是采取行动后的目标值。 在您的情况下，您设置了reward=states 。 由于您的功能正在尝试最大化回报，因此您的状态越接近[1，1，1，1]，您的AI应获得的奖励就越多。

也许诸如reward = sum(states)类的奖励函数将解决您的问题。

您如何创建深度Q学习神经网络来解决诸如蛇之类的简单游戏？

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-09-05 22:05:08

您如何创建深度Q学习神经网络来解决诸如蛇之类的简单游戏？

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-09-05 22:05:08

解决方案1
0 已采纳 2019-09-05 22:05:08