[英]Why is my Deep Q Network not learning to play a simple game?
So i have made a small python game where the player have to reach the end and avoid the traps, And it looks like this所以我做了一个小型 python 游戏,玩家必须到达终点并避开陷阱,它看起来像这样
I have tried many different batch sizes, Rewards, Input shapes, Amount of nodes in the hidden layer, But the network is still not training.我尝试了许多不同的批量大小、奖励、输入形状、隐藏层中的节点数量,但网络仍然没有训练。
The current way i'm training it, Is using 64 batch size with 100000 memory size, The input is a 1D array representing the game state + the player's coordinates + the amount of moves left before the game ends, And the reward starts at -distanceFromEnd + maxDistance / 2, If you reach the end you get +500 reward and the game is done, If you touch a trap you get a -100 reward and the game is done, If the game is not done in 64 moves you get a -200 reward and the game is done.我目前的训练方式是使用 64 批大小和 100000 memory 大小,输入是一个代表游戏 state + 玩家坐标 + 游戏结束前剩余移动量的一维数组,奖励从 - distanceFromEnd + maxDistance / 2,如果你到达终点你得到+500奖励并且游戏结束,如果你触摸一个陷阱你得到-100奖励并且游戏结束,如果游戏在64步中没有完成,你得到-200 奖励,游戏结束。
i'm using AdamOptimizer and MSE loss function, And for the activation functions i'm using ReLU for all the layers except the last layer i'm using nothing.我正在使用 AdamOptimizer 和 MSE 损失 function,对于激活函数,我对除了最后一层之外的所有层都使用 ReLU,我什么也没使用。
the player, end, traps positions are all randomized after each episode玩家、结束、陷阱的位置在每一集之后都是随机的
The average score (score is the sum of the rewards) for the last 100 games is around -30 even after 3000 episodes.即使在 3000 集之后,最后 100 场比赛的平均分数(分数是奖励的总和)也在 -30 左右。
The DQN is working fine on the gym game LunarLander-v2. DQN 在健身房游戏 LunarLander-v2 上运行良好。
And as i said i have been trying to tweak the values but it didn't help.正如我所说,我一直在尝试调整价值观,但没有帮助。
First here are the labels that i use in the state首先是我在 state 中使用的标签
FLOOR = 1
END = 2
TRAP = 3
PLAYER = 4
This is my step function这是我的步骤 function
def step(self, action):
isDone = False
if action == 0:
# Move Up
if self.playerY != 0:
self.playerY -= 1
elif action == 1:
# Move Down
if self.playerY != 7:
self.playerY += 1
elif action == 2:
# Move Right
if self.playerX != 0:
self.playerX -= 1
elif action == 3:
# Move Left
if self.playerX != 7:
self.playerX += 1
x = self.playerX - self.endX
x = x * x
y = self.playerY - self.endY
y = y * y
distance = math.sqrt(x + y)
reward = -distance + self.maxDist
#self.lastDist = distance
if self.state[self.playerX, self.playerY] == self.END:
reward = 500
isDone = True
elif self.state[self.playerX, self.playerY] == self.TRAP:
reward = -100
isDone = True
self.moves -= 1
if self.moves < 0:
reward = -200
isDone = True
return self.getFlatState(), reward, isDone, 0
State Getter function State 吸气剂 function
# Adding one to the players coordinates to avoid 0s as a try to fix the problem
def getFlatState(self):
return np.concatenate([np.ndarray.flatten(self.state), [self.playerX + 1, self.playerY + 1, self.moves]])
Here's the DQN/Agent script这是 DQN/代理脚本
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import load_model
class ReplayBuffer():
def __init__(self, max_size, input_dims):
self.mem_size = max_size
self.mem_cntr = 0
self.state_memory = np.zeros((self.mem_size, *input_dims),
dtype=np.float32)
self.new_state_memory = np.zeros((self.mem_size, *input_dims),
dtype=np.float32)
self.action_memory = np.zeros(self.mem_size, dtype=np.int32)
self.reward_memory = np.zeros(self.mem_size, dtype=np.float32)
self.terminal_memory = np.zeros(self.mem_size, dtype=np.int32)
def store_transition(self, state, action, reward, state_, done):
index = self.mem_cntr % self.mem_size
self.state_memory[index] = state
self.new_state_memory[index] = state_
self.reward_memory[index] = reward
self.action_memory[index] = action
self.terminal_memory[index] = 1 - int(done)
self.mem_cntr += 1
def sample_buffer(self, batch_size):
max_mem = min(self.mem_cntr, self.mem_size)
batch = np.random.choice(max_mem, batch_size, replace=False)
states = self.state_memory[batch]
states_ = self.new_state_memory[batch]
rewards = self.reward_memory[batch]
actions = self.action_memory[batch]
terminal = self.terminal_memory[batch]
return states, actions, rewards, states_, terminal
def build_dqn(lr, n_actions, input_dims, fc1_dims, fc2_dims):
model = keras.Sequential([
keras.layers.Dense(fc1_dims, activation='relu'),
keras.layers.Dense(fc2_dims, activation='relu'),
keras.layers.Dense(n_actions, activation=None)])
model.compile(optimizer=Adam(learning_rate=lr), loss='mean_squared_error')
return model
class Agent():
def __init__(self, lr, gamma, n_actions, epsilon, batch_size,
input_dims, epsilon_dec=1e-3, epsilon_end=0.01,
mem_size=1000000, fname='dqn_model.h5'):
self.action_space = [i for i in range(n_actions)]
self.gamma = gamma
self.epsilon = epsilon
self.eps_dec = epsilon_dec
self.eps_min = epsilon_end
self.batch_size = batch_size
self.model_file = fname
self.memory = ReplayBuffer(mem_size, input_dims)
self.q_eval = build_dqn(lr, n_actions, input_dims, 256, 128)
def store_transition(self, state, action, reward, new_state, done):
self.memory.store_transition(state, action, reward, new_state, done)
def choose_action(self, observation):
if np.random.random() < self.epsilon:
action = np.random.choice(self.action_space)
else:
state = np.array([observation])
actions = self.q_eval.predict(state)
action = np.argmax(actions)
return action
def learn(self):
if self.memory.mem_cntr < self.batch_size:
return
states, actions, rewards, states_, dones = \
self.memory.sample_buffer(self.batch_size)
q_eval = self.q_eval.predict(states)
q_next = self.q_eval.predict(states_)
q_target = np.copy(q_eval)
batch_index = np.arange(self.batch_size, dtype=np.int32)
q_target[batch_index, actions] = rewards + \
self.gamma * np.max(q_next, axis=1)*dones
self.q_eval.train_on_batch(states, q_target)
self.epsilon = self.epsilon - self.eps_dec if self.epsilon > \
self.eps_min else self.eps_min
def save_model(self):
self.q_eval.save(self.model_file)
def load_model(self):
self.q_eval = load_model(self.model_file)
The problem was that the goal state position and the initial position of the agent was not stationary.问题是代理的目标 state position 和初始 position 不是静止的。 When they are fixed as reported by OP the agent starts winning consistently about "90% of the time".
当它们按照 OP 的报告被修复时,代理开始持续赢得大约“90% 的时间”。
Though far from perfect I wouldn't expect much from a naive DQN.虽然远非完美,但我对天真的 DQN 期望不高。 Using more advanced techniques like A3C or even DDQN (Double Deep Q learning) should help you solve it.
使用更先进的技术,如 A3C 甚至 DDQN(双深度 Q 学习)应该可以帮助您解决它。 As we use a bit more advanced techniques to start to solve even more complex problems.
当我们使用更先进的技术开始解决更复杂的问题时。
small and easy tasks with not much future planning could be done with some more diverse methods such as 'Monte-Carlo'.可以使用一些更多样化的方法(例如“Monte-Carlo”)来完成没有太多未来计划的小而简单的任务。 But the main problem here is that your obstacles are randomly generated and the simple DQN does not formulate in advance what path it should take to avoid the
red areas
which give a negative reward.但这里的主要问题是你的障碍是随机生成的,简单的 DQN 并没有预先制定应该采取什么路径来避开给出负奖励的
red areas
。
DQN is essentially Q-learning but values are stored in a more compressed state so to accommodate a bit more than possible. DQN 本质上是 Q 学习,但值存储在更压缩的 state 中,以便容纳更多。 So it is not reliable for such complex solutions (as said before).
因此,对于如此复杂的解决方案(如前所述),它是不可靠的。 So simply put the solution is to just use more complex and new methods, many of which I have mentioned.
所以简单地说解决方案就是使用更复杂和新的方法,其中很多我已经提到过。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.