简体   繁体   中英

How do you create Deep Q-Learning neural network to solve simple games like snake?

I have been working for the last four days to try and create a simple working neural network(NN) that learns. I started off with the tower of Hanoi but that was quite tricky (doable with a Q-table)and no-one has really got good examples online so I decided to do it instead for snake game where there are lots of examples and tutorials. Long story short i have made a new super simple game where you have [0,0,0,0] and by picking 0, 1, 2, or 3 you change a 0 to a 1 or vice versa. So picking 1 would give an output of [0,1,0,0] and picking 1 again goes back to [0,0,0,0]. Very easy

Despite the game being very easy I'm very much struggling to go from concept to practical as I have no education in coding.

The final goal right now is to get the code below to be able to complete the game more than once. (it currently has run about 600 times and not once completed the 4 step problem)

Current network architecture is 4 inputs 4 nodes in 1st hidden layer and 4 outputs and I would like to keep it this way even if the hidden layer is redundant just so I can learn how to do it correctly for other problems.

If you cant be bothered to read the code and I don't blame you, ill put my mental psudocode here:

  1. setup variables, placeholders and import libraries
  2. run program 200 times to give it a chance to learn and each run has 20 turns
  3. run through the NN with "states" as the input and get out the output defined as "output" for future use
  4. game code
  5. the new reward for this specific game would just be the new set of states as (it has just occurred to me that this is the wrong way round ([0,1,0,0] for states should have rewards [1,0,1,1]) but i have already tried flipping it and it it still didn't work so that isn't the issue)
  6. my thinking was that i could get the next Q value by just running the new states through the NN
  7. this equation is taken directly from any deep q-learning tutorial on the internet and i think that maybe I have got this or one of the components for this wrong.
  8. run the gradient decent optimisation function
import tensorflow as tf             ## importing libraries
import random
import numpy as np

epsilon = 0.1                       ## create non tf variables
y = 0.4
memory = []
memory1 = []

input_ = tf.placeholder(tf.float32, [None, 4], name='input_') 
W1 = tf.Variable(tf.random_normal([4, 4], stddev=0.03), name='W1') 
b1 = tf.Variable(tf.random_normal([4]), name='b1')    
hidden_out = tf.add(tf.matmul(input_, W1), b1, name='hidden_out')   ## W for weights
hidden_out = tf.nn.relu(hidden_out)                                 ## b for bias'

W2 = tf.Variable(tf.random_normal([4, 4], stddev=0.03), name='W2')
b2 = tf.Variable(tf.random_normal([4]), name='b2')
Qout = tf.add(tf.matmul(hidden_out, W2), b2, name='Qout')
sig_out = tf.sigmoid(Qout, name='out')


Q_target = tf.placeholder(shape=(None,4), dtype=tf.float32)
loss = tf.reduce_sum(tf.square(Q_target - Qout))
optimiser = tf.train.GradientDescentOptimizer(learning_rate=y).minimize(loss)

init_op = tf.global_variables_initializer()

with tf.compat.v1.Session() as sess:
    sess.run(init_op)
    for epoch in range(200):         ## run game 200 times
        states = [0,0,0,0]
        for _ in range(20):          ## 20 turns to do the correct 4 moves
            if _ == 19:
                memory1.append(states)
            output = np.argmax(sess.run(sig_out, feed_dict={input_: [states]}))
            ## sig_out is the output put through a sigmoid function
            if random.random() < epsilon:       ## this is the code for the game 
                output = random.randint(0,3)    ## ...
            if states[output] == 0:             ## ...
                states[output] = 1              ## ...
            else:                               ## ...
                states[output] = 0              ## ...
            reward = states     
            Qout1 = sess.run(sig_out, feed_dict={input_: [states]})
            target = [reward + y*np.max(Qout1)]
            sess.run([optimiser,loss], feed_dict={input_: [states], Q_target: target})

I haven't got any error messages in a while with this, the actual result would ideally be [1,1,1,1] every time.

Thanks in advance for all of your help

ps i couldn't think of an objective title for this, sorry

The reward value should be the objective value after an action has been taken. In your case, you have set reward=states . Since your function is attempting to maximize reward, the closer to [1, 1, 1, 1] your state gets, the more reward your AI should receive.

Perhaps a reward function such as reward = sum(states) will solve your problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM