I have been working for the last four days to try and create a simple working neural network(NN) that learns. I started off with the tower of Hanoi but that was quite tricky (doable with a Q-table)and no-one has really got good examples online so I decided to do it instead for snake game where there are lots of examples and tutorials. Long story short i have made a new super simple game where you have [0,0,0,0] and by picking 0, 1, 2, or 3 you change a 0 to a 1 or vice versa. So picking 1 would give an output of [0,1,0,0] and picking 1 again goes back to [0,0,0,0]. Very easy
Despite the game being very easy I'm very much struggling to go from concept to practical as I have no education in coding.
The final goal right now is to get the code below to be able to complete the game more than once. (it currently has run about 600 times and not once completed the 4 step problem)
Current network architecture is 4 inputs 4 nodes in 1st hidden layer and 4 outputs and I would like to keep it this way even if the hidden layer is redundant just so I can learn how to do it correctly for other problems.
If you cant be bothered to read the code and I don't blame you, ill put my mental psudocode here:
import tensorflow as tf ## importing libraries
import random
import numpy as np
epsilon = 0.1 ## create non tf variables
y = 0.4
memory = []
memory1 = []
input_ = tf.placeholder(tf.float32, [None, 4], name='input_')
W1 = tf.Variable(tf.random_normal([4, 4], stddev=0.03), name='W1')
b1 = tf.Variable(tf.random_normal([4]), name='b1')
hidden_out = tf.add(tf.matmul(input_, W1), b1, name='hidden_out') ## W for weights
hidden_out = tf.nn.relu(hidden_out) ## b for bias'
W2 = tf.Variable(tf.random_normal([4, 4], stddev=0.03), name='W2')
b2 = tf.Variable(tf.random_normal([4]), name='b2')
Qout = tf.add(tf.matmul(hidden_out, W2), b2, name='Qout')
sig_out = tf.sigmoid(Qout, name='out')
Q_target = tf.placeholder(shape=(None,4), dtype=tf.float32)
loss = tf.reduce_sum(tf.square(Q_target - Qout))
optimiser = tf.train.GradientDescentOptimizer(learning_rate=y).minimize(loss)
init_op = tf.global_variables_initializer()
with tf.compat.v1.Session() as sess:
sess.run(init_op)
for epoch in range(200): ## run game 200 times
states = [0,0,0,0]
for _ in range(20): ## 20 turns to do the correct 4 moves
if _ == 19:
memory1.append(states)
output = np.argmax(sess.run(sig_out, feed_dict={input_: [states]}))
## sig_out is the output put through a sigmoid function
if random.random() < epsilon: ## this is the code for the game
output = random.randint(0,3) ## ...
if states[output] == 0: ## ...
states[output] = 1 ## ...
else: ## ...
states[output] = 0 ## ...
reward = states
Qout1 = sess.run(sig_out, feed_dict={input_: [states]})
target = [reward + y*np.max(Qout1)]
sess.run([optimiser,loss], feed_dict={input_: [states], Q_target: target})
I haven't got any error messages in a while with this, the actual result would ideally be [1,1,1,1] every time.
Thanks in advance for all of your help
ps i couldn't think of an objective title for this, sorry
The reward
value should be the objective value after an action has been taken. In your case, you have set reward=states
. Since your function is attempting to maximize reward, the closer to [1, 1, 1, 1] your state gets, the more reward your AI should receive.
Perhaps a reward function such as reward = sum(states)
will solve your problem.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.