为什么我的 Deep Q Network 没有学会玩简单的游戏？

Question

So i have made a small python game where the player have to reach the end and avoid the traps, And it looks like this所以我做了一个小型 python 游戏，玩家必须到达终点并避开陷阱，它看起来像这样

I have tried many different batch sizes, Rewards, Input shapes, Amount of nodes in the hidden layer, But the network is still not training.我尝试了许多不同的批量大小、奖励、输入形状、隐藏层中的节点数量，但网络仍然没有训练。

The current way i'm training it, Is using 64 batch size with 100000 memory size, The input is a 1D array representing the game state + the player's coordinates + the amount of moves left before the game ends, And the reward starts at -distanceFromEnd + maxDistance / 2, If you reach the end you get +500 reward and the game is done, If you touch a trap you get a -100 reward and the game is done, If the game is not done in 64 moves you get a -200 reward and the game is done.我目前的训练方式是使用 64 批大小和 100000 memory 大小，输入是一个代表游戏 state + 玩家坐标 + 游戏结束前剩余移动量的一维数组，奖励从 - distanceFromEnd + maxDistance / 2，如果你到达终点你得到+500奖励并且游戏结束，如果你触摸一个陷阱你得到-100奖励并且游戏结束，如果游戏在64步中没有完成，你得到-200 奖励，游戏结束。

i'm using AdamOptimizer and MSE loss function, And for the activation functions i'm using ReLU for all the layers except the last layer i'm using nothing.我正在使用 AdamOptimizer 和 MSE 损失 function，对于激活函数，我对除了最后一层之外的所有层都使用 ReLU，我什么也没使用。

the player, end, traps positions are all randomized after each episode玩家、结束、陷阱的位置在每一集之后都是随机的

The average score (score is the sum of the rewards) for the last 100 games is around -30 even after 3000 episodes.即使在 3000 集之后，最后 100 场比赛的平均分数（分数是奖励的总和）也在 -30 左右。
The DQN is working fine on the gym game LunarLander-v2. DQN 在健身房游戏 LunarLander-v2 上运行良好。
And as i said i have been trying to tweak the values but it didn't help.正如我所说，我一直在尝试调整价值观，但没有帮助。

First here are the labels that i use in the state首先是我在 state 中使用的标签

  FLOOR = 1
  END = 2
  TRAP = 3
  PLAYER = 4

This is my step function这是我的步骤 function

 def step(self, action):
isDone = False
if action == 0:
  # Move Up
  if self.playerY != 0:
    self.playerY -= 1
elif action == 1:
  # Move Down
  if self.playerY != 7:
    self.playerY += 1
elif action == 2:
  # Move Right
  if self.playerX != 0:
    self.playerX -= 1
elif action == 3:
  # Move Left
  if self.playerX != 7:
    self.playerX += 1

x = self.playerX - self.endX
x = x * x
y = self.playerY - self.endY
y = y * y

distance = math.sqrt(x + y)
reward = -distance + self.maxDist
#self.lastDist = distance

if self.state[self.playerX, self.playerY] == self.END:
  reward = 500
  isDone = True
elif self.state[self.playerX, self.playerY] == self.TRAP:
  reward = -100
  isDone = True

self.moves -= 1

if self.moves < 0:
  reward = -200
  isDone = True

return self.getFlatState(), reward, isDone, 0

State Getter function State 吸气剂 function

  # Adding one to the players coordinates to avoid 0s as a try to fix the problem
  def getFlatState(self):
     return np.concatenate([np.ndarray.flatten(self.state), [self.playerX + 1, self.playerY + 1, self.moves]])

Here's the DQN/Agent script这是 DQN/代理脚本

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import load_model

class ReplayBuffer():
def __init__(self, max_size, input_dims):
    self.mem_size = max_size
    self.mem_cntr = 0

    self.state_memory = np.zeros((self.mem_size, *input_dims), 
                                dtype=np.float32)
    self.new_state_memory = np.zeros((self.mem_size, *input_dims),
                            dtype=np.float32)
    self.action_memory = np.zeros(self.mem_size, dtype=np.int32)
    self.reward_memory = np.zeros(self.mem_size, dtype=np.float32)
    self.terminal_memory = np.zeros(self.mem_size, dtype=np.int32)

def store_transition(self, state, action, reward, state_, done):
    index = self.mem_cntr % self.mem_size
    self.state_memory[index] = state
    self.new_state_memory[index] = state_
    self.reward_memory[index] = reward
    self.action_memory[index] = action
    self.terminal_memory[index] = 1 - int(done)
    self.mem_cntr += 1

def sample_buffer(self, batch_size):
    max_mem = min(self.mem_cntr, self.mem_size)
    batch = np.random.choice(max_mem, batch_size, replace=False)

    states = self.state_memory[batch]
    states_ = self.new_state_memory[batch]
    rewards = self.reward_memory[batch]
    actions = self.action_memory[batch]
    terminal = self.terminal_memory[batch]

    return states, actions, rewards, states_, terminal

def build_dqn(lr, n_actions, input_dims, fc1_dims, fc2_dims):
model = keras.Sequential([
    keras.layers.Dense(fc1_dims, activation='relu'),
    keras.layers.Dense(fc2_dims, activation='relu'),
    keras.layers.Dense(n_actions, activation=None)])
model.compile(optimizer=Adam(learning_rate=lr), loss='mean_squared_error')

return model

class Agent():
def __init__(self, lr, gamma, n_actions, epsilon, batch_size,
            input_dims, epsilon_dec=1e-3, epsilon_end=0.01,
            mem_size=1000000, fname='dqn_model.h5'):
    self.action_space = [i for i in range(n_actions)]
    self.gamma = gamma
    self.epsilon = epsilon
    self.eps_dec = epsilon_dec
    self.eps_min = epsilon_end
    self.batch_size = batch_size
    self.model_file = fname
    self.memory = ReplayBuffer(mem_size, input_dims)
    self.q_eval = build_dqn(lr, n_actions, input_dims, 256, 128)

def store_transition(self, state, action, reward, new_state, done):
    self.memory.store_transition(state, action, reward, new_state, done)

def choose_action(self, observation):
    if np.random.random() < self.epsilon:
        action = np.random.choice(self.action_space)
    else:
        state = np.array([observation])
        actions = self.q_eval.predict(state)

        action = np.argmax(actions)

    return action

def learn(self):
    if self.memory.mem_cntr < self.batch_size:
        return

    states, actions, rewards, states_, dones = \
            self.memory.sample_buffer(self.batch_size)

    q_eval = self.q_eval.predict(states)
    q_next = self.q_eval.predict(states_)


    q_target = np.copy(q_eval)
    batch_index = np.arange(self.batch_size, dtype=np.int32)

    q_target[batch_index, actions] = rewards + \
                    self.gamma * np.max(q_next, axis=1)*dones


    self.q_eval.train_on_batch(states, q_target)

    self.epsilon = self.epsilon - self.eps_dec if self.epsilon > \
             self.eps_min else self.eps_min

def save_model(self):
    self.q_eval.save(self.model_file)


def load_model(self):
    self.q_eval = load_model(self.model_file)

Answer 1

The problem was that the goal state position and the initial position of the agent was not stationary.问题是代理的目标 state position 和初始 position 不是静止的。 When they are fixed as reported by OP the agent starts winning consistently about "90% of the time".当它们按照 OP 的报告被修复时，代理开始持续赢得大约“90% 的时间”。

Though far from perfect I wouldn't expect much from a naive DQN.虽然远非完美，但我对天真的 DQN 期望不高。 Using more advanced techniques like A3C or even DDQN (Double Deep Q learning) should help you solve it.使用更先进的技术，如 A3C 甚至 DDQN（双深度 Q 学习）应该可以帮助您解决它。 As we use a bit more advanced techniques to start to solve even more complex problems.当我们使用更先进的技术开始解决更复杂的问题时。

small and easy tasks with not much future planning could be done with some more diverse methods such as 'Monte-Carlo'.可以使用一些更多样化的方法（例如“Monte-Carlo”）来完成没有太多未来计划的小而简单的任务。 But the main problem here is that your obstacles are randomly generated and the simple DQN does not formulate in advance what path it should take to avoid the red areas which give a negative reward.但这里的主要问题是你的障碍是随机生成的，简单的 DQN 并没有预先制定应该采取什么路径来避开给出负奖励的red areas 。

DQN is essentially Q-learning but values are stored in a more compressed state so to accommodate a bit more than possible. DQN 本质上是 Q 学习，但值存储在更压缩的 state 中，以便容纳更多。 So it is not reliable for such complex solutions (as said before).因此，对于如此复杂的解决方案（如前所述），它是不可靠的。 So simply put the solution is to just use more complex and new methods, many of which I have mentioned.所以简单地说解决方案就是使用更复杂和新的方法，其中很多我已经提到过。

为什么我的 Deep Q Network 没有学会玩简单的游戏？

问题描述

1 个解决方案

解决方案1
0 2020-04-23 11:52:10

为什么我的 Deep Q Network 没有学会玩简单的游戏？

问题描述

1 个解决方案

解决方案1 0 2020-04-23 11:52:10

解决方案1
0 2020-04-23 11:52:10