我应该如何为我的类似网格世界的环境定义 state？

Question

The problem i want to solve is actually not this simple, but this is kind of a toy game to help me solve the greater problem.我要解决的问题其实不是这么简单，而是一种玩具游戏，可以帮助我解决更大的问题。

so i have a 5x5 matrix with values all equal to 0:所以我有一个 5x5 矩阵，其值都等于 0：

structure = np.zeros(25).reshape(5, 5)

and the goal is for the agent to turn all values into 1, so i have:目标是让代理将所有值变成 1，所以我有：

goal_structure = np.ones(25).reshape(5, 5)

i created a class Player with 5 actions to go either left, right, up, down or flip (turn the value 0 to 1 or 1 to 0).我创建了一个 class 播放器，对 go 有 5 个动作，左、右、上、下或翻转（将值 0 转为 1 或 1 转为 0）。 For the reward, if the agent changes the value 0 into 1, it gets a +1 reward.对于奖励，如果代理将值 0 更改为 1，则获得 +1 奖励。 if it turns a 1 into 0 in gets a negative reward (i tried many values from -1 to 0 or even -0.1).如果它将 1 变成 0 得到负奖励（我尝试了从 -1 到 0 甚至 -0.1 的许多值）。 and if it just goes left, right, up or down, it gets a reward 0.如果它只是向左、向右、向上或向下移动，它会得到奖励 0。

Because i want to feed the state to my neural net, i reshaped the state as below:因为我想将 state 提供给我的神经网络，所以我将 state 重塑如下：

reshaped_structure = np.reshape(structure, (1, 25))

and then i add the normalized position of the agent to the end of this array (because i suppose the agent should have a sense of where it is):然后我将代理的标准化 position 添加到该数组的末尾（因为我认为代理应该知道它在哪里）：

reshaped_state = np.append(reshaped_structure, (np.float64(self.x/4), np.float64(self.y/4)))
state = reshaped_state

but i dont get any good results, it just like its random,i tried different reward functions, different optimizing algorithms, such as Exeperience replay, target net, Double DQN.但是我没有得到任何好的结果，它就像它的随机一样，我尝试了不同的奖励函数，不同的优化算法，例如体验回放、目标网络、Double DQN。 duelling?决斗？ but non of them seem to work!但它们似乎都不起作用！ and i guess the problem is with defining the state.我想问题在于定义状态。 Can any one maybe helping me with defining a good state?任何人都可以帮助我定义一个好的状态吗？

Thanks a lot!非常感谢！

ps: this is my step function: ps：这是我的步骤function：

class Player:

def __init__(self):
    self.x = 0
    self.y = 0

    self.max_time_step = 50
    self.time_step = 0
    self.reward_list = []
    self.sum_reward_list = []
    self.sum_rewards = []

    self.gather_positions = []
    # self.dict = {}

    self.action_space = spaces.Discrete(5)
    self.observation_space = 27

def get_done(self, time_step):

    if time_step == self.max_time_step:
        done = True

    else:
        done = False

    return done

def flip_pixel(self):

    if structure[self.x][self.y] == 1:
        structure[self.x][self.y] = 0.0

    elif structure[self.x][self.y] == 0:
        structure[self.x][self.y] = 1

def step(self, action, time_step):

    reward = 0

    if action == right:

        if self.y < y_threshold:
            self.y = self.y + 1
        else:
            self.y = y_threshold

    if action == left:

        if self.y > y_min:
            self.y = self.y - 1
        else:
            self.y = y_min

    if action == up:

        if self.x > x_min:
            self.x = self.x - 1
        else:
            self.x = x_min

    if action == down:

        if self.x < x_threshold:
            self.x = self.x + 1
        else:
            self.x = x_threshold

    if action == flip:
        self.flip_pixel()

        if structure[self.x][self.y] == 1:
            reward = 1
        else:
            reward = -0.1



    self.reward_list.append(reward)

    done = self.get_done(time_step)

    reshaped_structure = np.reshape(structure, (1, 25))
    reshaped_state = np.append(reshaped_structure, (np.float64(self.x/4), np.float64(self.y/4)))
    state = reshaped_state

    return state, reward, done

def reset(self):

    structure = np.zeros(25).reshape(5, 5)

    reset_reshaped_structure = np.reshape(structure, (1, 25))
    reset_reshaped_state = np.append(reset_reshaped_structure, (0, 0))
    state = reset_reshaped_state

    self.x = 0
    self.y = 0
    self.reward_list = []

    self.gather_positions = []
    # self.dict.clear()

    return state

Answer 1

I would encode the agent position as a matrix like this:我会将代理 position 编码为如下矩阵：

(where the agent is in the middle). （代理在中间）。 Of course you have to flatten this too for the network.当然，对于网络，您也必须将其展平。 So your total state is 50 input values, 25 for the cell states, and 25 for the agent position.因此，您的总 state 是 50 个输入值，25 个用于单元状态，25 个用于代理 position。

When you encode the position as two floats, then the network has to do work decoding the exact value of the floats.当您将 position 编码为两个浮点数时，网络必须对浮点数的确切值进行解码。 If you use an explicit scheme like the one above, it is very clear to the network exactly where the agent is.如果您使用像上面这样的显式方案，网络将非常清楚代理的确切位置。 This is a "one-hot" encoding for position.这是 position 的“单热”编码。

If you look at the atari DQN papers for example, the agent position is always explicitly encoded with a neuron for each possible position.例如，如果您查看 atari DQN 论文，代理 position 始终使用每个可能的 position 的神经元显式编码。

Note also that a very good policy for your agent is to stand still and constantly flip the state, it makes 0.45 reward per step for doing this (+1 for 0 to 1, -0.1 for 1 to 0, split over 2 steps).另请注意，对于您的代理来说，一个非常好的策略是静止不动并不断翻转 state，这样做每步可以获得 0.45 奖励（0 到 1 为 +1，1 到 0 为 -0.1，分成 2 步）。 Assuming a perfect policy it can only make 25, but this policy will make a 22.5 reward and be very hard to unlearn.假设一个完美的策略只能得到 25，但是这个策略会得到 22.5 的奖励并且很难忘记。 I would suggest that the agent gets a -1 for unflipping a good reward.我建议代理因为取消了良好的奖励而获得 -1。

You mention that the agent is not learning.你提到代理没有学习。 Might I suggest that you try to simplify as much as possible.我是否建议您尝试尽可能简化。 First suggestion is - reduce the length of the episode to 2 or 3 steps, and reduce the size of the grid to 1. See if the agent can learn to consistently set the cell to 1. At the same time, simplify your agent's brain as much as possible.第一个建议是 - 将情节长度减少到 2 或 3 步，并将网格大小减少到 1。看看智能体是否可以学会一致地将单元格设置为 1。同时，将智能体的大脑简化为尽可能。 Reduce it to just a single output layer - a linear model with an activation.将其缩减为仅一个 output 层 - 一个带有激活的线性 model。 This should be very quick and easy to learn.这应该非常快速且易于学习。 If the agent does not learn this within 100 episodes, I suspect there is a bug in your RL implementation.如果代理在 100 集中没有学习到这一点，我怀疑你的 RL 实现中存在错误。 If it works you can start to expand the size of the grid, and the size of the network.如果可行，您可以开始扩大网格的大小和网络的大小。

我应该如何为我的类似网格世界的环境定义 state？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-04-12 10:20:34

我应该如何为我的类似网格世界的环境定义 state？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-04-12 10:20:34

解决方案1
1 已采纳 2020-04-12 10:20:34