Q-learning学习扫雷行为

Question

I am attempting to use Q-learning to learn minesweeping behavior on a discreet version of Mat Buckland's smart sweepers, the original available here http://www.ai-junkie.com/ann/evolved/nnt1.html , for an assignment.我正在尝试使用 Q-learning 来学习 Mat Buckland 智能扫地机的谨慎版本的扫雷行为，原始版本可在此处http://www.ai-junkie.com/ann/evolved/nnt1.html获得，用于分配。 The assignment limits us to 50 iterations of 2000 moves on a grid that is effectively 40x40, with the mines resetting and the agent being spawned in a random location each iteration.该分配将我们限制在有效为 40x40 的网格上进行 50 次迭代，每次 2000 次移动，每次迭代都会重置地雷并且在随机位置生成代理。

I've attempted performing q learning with penalties for moving, rewards for sweeping mines and penalties for not hitting a mine.我尝试过用移动惩罚、扫地雷奖励和没有击中地雷的惩罚来执行 q 学习。 The sweeper agent seems unable to learn how to sweep mines effectively within the 50 iterations because it learns that going to specific cell is good, but after a the mine is gone it is no longer rewarded, but penalized for going to that cell with the movement cost扫地机器人似乎无法在 50 次迭代中学习如何有效地扫地雷，因为它知道去特定的单元格是好的，但是在地雷消失后，它不再获得奖励，而是因为移动到那个单元格而受到惩罚成本

I wanted to attempt providing rewards only when all the mines were cleared in an attempt to make the environment static as there would only be a state of not all mines collected, or all mines collected, but am struggling to implement this due to the agent having only 2000 moves per iteration and being able to backtrack, it never manages to sweep all the mines in an iteration within the limit with or without rewards for collecting mines.我想尝试仅在清除所有地雷以尝试创建环境 static 时提供奖励，因为只有 state 不是所有收集的地雷，或所有收集的地雷，但由于代理有每次迭代只有 2000 次移动并且能够回溯，它永远无法在限制范围内扫除一次迭代中的所有地雷，无论是否有收集地雷的奖励。

Another idea I had was to have an effectively new Q matrix for each mine, so once a mine is collected, the sweeper transitions to that matrix and operates off that where the current mine is excluded from consideration.我的另一个想法是为每个地雷建立一个有效的新 Q 矩阵，因此一旦收集到一个地雷，清扫器就会转换到该矩阵并在当前地雷被排除在考虑之外的地方运行。

Are there any better approaches that I can take with this, or perhaps more practical tweaks to my own approach that I can try?有没有更好的方法可以用这个，或者我可以尝试对我自己的方法进行更实际的调整？

A more explicit explanation of the rules:更明确的规则解释：

The map edges wrap around, so moving off the right edge of the map will cause the bot to appear on the left edge etc. map 边缘环绕，因此从 map 的右边缘移开将导致机器人出现在左边缘等。
The sweeper bot can move up down, left or right from any map tile.扫地机器人可以从任何 map 瓷砖上上下左右移动。
When the bot collides with a mine, the mine is considered swept and then removed.当机器人与地雷相撞时，地雷被认为被扫过然后被移除。
The aim is for the bot to learn to sweep all mines on the map from any starting position.目的是让机器人学会从 position 开始扫除 map 上的所有地雷。

Answer 1

Given that the sweeper can always see the nearest mine, this should be pretty easy.鉴于扫地机总能看到最近的矿井，这应该很容易。 From your question I assume your only problem is finding a good reward function and representation for your agent state.根据您的问题，我认为您唯一的问题是找到一个好的奖励 function 和您的代理 state 的代表。

Defining a state定义 state

Absolute positions are rarely useful in a random environment, especially if the environment is infinite like in your example (since the bot can drive over the borders and respawn at the other side).绝对位置在随机环境中很少有用，尤其是在您的示例中环境是无限的情况下（因为机器人可以越过边界并在另一侧重生）。 This means that the size of the environment isn't needed for the agent to operate (we will actually need it to simulate the infinite space, tho).这意味着代理操作不需要环境的大小（我们实际上将需要它来模拟无限空间）。

A reward function calculates its return value based on the current state of the agent compared to its previous state.奖励 function 根据代理的当前 state 与其之前的 state 相比计算其返回值。 But how do we define a state?但是我们如何定义 state？ Lets see what we actually need in order to operate the agent like we want it to.让我们看看我们真正需要什么才能像我们想要的那样操作代理。

The position of the agent.代理的position。
The position of the nearest mine.最近矿山的position。

That is all we need.这就是我们所需要的。 Now I said erlier that absolute positions are bad.现在我说过绝对位置不好。 This is because it makes the Q table (you call it Q matrix) static and very fragile to randomness.这是因为它使 Q 表（您称其为 Q 矩阵）static 并且对随机性非常脆弱。 So let's try to completely eliminate abosulte positions from the reward function and replace them with relative positions.因此，让我们尝试从奖励 function 中完全消除绝对位置，并将它们替换为相对位置。 Luckily, this is very simple in your case: instead of using the absolute positions, we use the relative position between the nearest mine and the agent.幸运的是，这在您的情况下非常简单：我们不使用绝对位置，而是使用最近的矿井和代理之间的相对 position。

Now we don't deal with coordinates anymore, but vectors.现在我们不再处理坐标，而是向量。 Lets calculate the vector between our points: v = pos_mine - pos_agent .让我们计算点之间的向量： v = pos_mine - pos_agent 。 This vector gives us two very important pieces of information:该向量为我们提供了两个非常重要的信息：

the direction in which the nearst mine is, and最近的地雷所在的方向，以及
the distance to the nearest mine.到最近的矿井的距离。

And these are all we need to make our agent operational.这些就是我们使代理运行所需的全部内容。 Therefore, an agent state can be defined as因此，一个代理 state 可以定义为

State: Direction x Distance

of which distance is a floating point value and direction either a float that describes the angle or a normalized vector.其中距离是浮点值和方向，可以是描述角度的浮点数，也可以是归一化向量。

Defining a reward function定义奖励 function

Given our newly defined state, the only thing we care about in our reward function is the distance.鉴于我们新定义的 state，我们在奖励 function 中唯一关心的是距离。 Since all we want is to move the agent towards mines, the distance is all that matters.由于我们想要的只是将代理移向地雷，因此距离很重要。 Here are a few guesses how the reward function could work:以下是奖励 function 的一些猜测：

If the agent sweeps a mine (distance == 0), return a huge reward (ex. 100).如果代理扫雷（距离 == 0），则返回巨额奖励（例如 100）。
If the agent moves towards a mine (distance is shrinking), return a neutral (or small) reward (ex. 0).如果代理向地雷移动（距离正在缩小），则返回中性（或小的）奖励（例如 0）。
If the agent moves away from a mine (distance is increasing), retuan a negative reward (ex. -1).如果代理远离地雷（距离正在增加），则重新获得负奖励（例如 -1）。

Theoretically, since we penaltize moving away from a mine, we don't even need rule 1 here.从理论上讲，由于我们会惩罚离开矿井，因此我们甚至不需要规则 1。

Conclusion结论

The only thing left is determining a good learning rate and discount so that your agent performs well after 50 iterations.剩下的唯一事情是确定一个好的学习率和折扣，以便您的代理在 50 次迭代后表现良好。 But, given the simplicity of the environment, this shouldn't even matter that much.但是，考虑到环境的简单性，这甚至不应该那么重要。 Experiment.实验。

Q-learning学习扫雷行为

问题描述

1 个解决方案

解决方案1
1 2019-10-10 13:59:18

Defining a state定义 state

Defining a reward function定义奖励 function

Conclusion结论

Q-learning学习扫雷行为

问题描述

1 个解决方案

解决方案1 1 2019-10-10 13:59:18

Defining a state定义 state

Defining a reward function定义奖励 function

Conclusion结论

解决方案1
1 2019-10-10 13:59:18