马尔可夫决策过程：价值迭代，它是如何运作的？

Question

I've been reading a lot about Markov Decision Processes (using value iteration) lately but I simply can't get my head around them. 我最近一直在阅读很多关于Markov Decision Processes（使用价值迭代）的内容，但我根本无法理解它们。 I've found a lot of resources on the Internet / books, but they all use mathematical formulas that are way too complex for my competencies. 我在互联网/书籍上找到了很多资源，但他们都使用的数学公式对我的能力来说过于复杂。

Since this is my first year at college, I've found that the explanations and formulas provided on the web use notions / terms that are way too complicated for me and they assume that the reader knows certain things that I've simply never heard of. 由于这是我上大学的第一年，我发现网上提供的解释和公式使用的概念/术语对我来说太复杂了，他们认为读者知道我从未听说过的某些事情。。

I want to use it on a 2D grid (filled with walls(unattainable), coins(desirable) and enemies that move(which must be avoided at all costs)). 我想在2D网格上使用它（填充墙壁（无法实现），硬币（可取）和移动的敌人（必须不惜一切代价避免））。 The whole goal is to collect all the coins without touching the enemies, and I want to create an AI for the main player using a Markov Decision Process ( MDP ). 整个目标是收集所有硬币而不触及敌人，我想使用马尔可夫决策过程（ MDP ）为主要玩家创建AI。 Here is how it partially looks like (note that the game-related aspect is not so much of a concern here. I just really want to understand MDPs in general): 以下是它的部分外观（请注意，与游戏相关的方面在这里并不是很重要。我只是想了解一般的MDP ）：

在此输入图像描述

From what I understand, a rude simplification of MDPs is that they can create a grid which holds in which direction we need to go (kind of a grid of "arrows" pointing where we need to go, starting at a certain position on the grid) to get to certain goals and avoid certain obstacles. 根据我的理解， MDP的粗略简化是它们可以创建一个网格，它保持我们需要去的方向（指向我们需要去的地方的“箭头”网格的类型，从网格上的某个位置开始）达到某些目标并避免某些障碍。 Specific to my situation, that would mean that it allows the player to know in which direction to go to collect the coins and avoid the enemies. 具体到我的情况，这意味着它允许玩家知道去哪个方向收集硬币并避开敌人。

Now, using the MDP terms, it would mean that it creates a collection of states(the grid) which holds certain policies(the action to take -> up, down, right, left) for a certain state(a position on the grid). 现在，使用MDP术语，这意味着它创建了一个状态集合（网格），它保存某个特定状态的某些策略（要采取的动作 - >向上，向下，向右，向左）（网格上的一个位置））。 The policies are determined by the "utility" values of each state, which themselves are calculated by evaluating how much getting there would be beneficial in the short and long term. 这些政策由每个州的“效用”值决定，这些价值本身是通过评估在短期和长期内获得多少益处来计算的。

Is this correct? 它是否正确？ Or am I completely on the wrong track? 还是我完全走错了路？

I'd at least like to know what the variables from the following equation represent in my situation: 我至少想知道以下等式中的变量在我的情况中代表什么：

$U_ {i + 1}（s）\\ longleftarrow R（s）+ \\ gamma \\ max \\ sum \\ limits_ {s'} T（s，a，s'）U_i（s'）\\，。$

(taken from the book "Artificial Intelligence - A Modern Approach" from Russell & Norvig) （取自Russell＆Norvig的“人工智能 - 现代方法”一书）

I know that s would be a list of all the squares from the grid, a would be a specific action (up / down / right / left), but what about the rest? 我知道， s会从网格中的所有平方的名单， a会是一个特定的动作（上/下/左/右），但对于其他人呢？

How would the reward and utility functions be implemented? 如何实施奖励和效用函数？

It would be really great if someone knew a simple link which shows pseudo-code to implement a basic version with similarities to my situation in a very slow way, because I don't even know where to start here. 如果有人知道一个简单的链接，显示伪代码以非常慢的方式实现与我的情况相似的基本版本，那将是非常好的，因为我甚至不知道从哪里开始。

Thank you for your precious time. 谢谢你宝贵的时间。

(Note: feel free to add / remove tags or tell me in the comments if I should give more details about something or anything like that.) （注意：随意添加/删除标签或在评论中告诉我是否应该提供有关某些内容或类似内容的更多详细信息。）

Answer 1

Yes, the mathematical notation can make it seem much more complicated than it is. 是的，数学符号可以使它看起来比它复杂得多。 Really, it is a very simple idea. 真的，这是一个非常简单的想法。 I have a implemented a value iteration demo applet that you can play with to get a better idea. 我已经实现了一个值迭代演示applet ，您可以使用它来获得更好的想法。

Basically, lets say you have a 2D grid with a robot in it. 基本上，假设您有一个带有机器人的2D网格。 The robot can try to move North, South, East, West (those are the actions a) but, because its left wheel is slippery, when it tries to move North there is only a .9 probability that it will end up at the square North of it while there is a .1 probability that it will end up at the square West of it (similarly for the other 3 actions). 机器人可以尝试移动北，南，东，西（这些是动作a），但是，因为它的左轮是滑的，当它试图向北移动时，只有.9概率它最终会在广场上在它的北边，虽然它有一个概率，它最终会在它的西边（相似的其他3个动作）。 These probabilities are captured by the T() function. 这些概率由T（）函数捕获。 Namely, T(s,A,s') will look like: 即，T（s，A，s'）将如下所示：

s    A      s'     T    //x=0,y=0 is at the top-left of the screen
x,y  North  x,y+1  .9   //we do move north
x,y  North  x-1,y  .1   //wheels slipped, so we move West
x,y  East   x+1,y  .9
x,y  East   x,y-1  .1
x,y  South  x,y+1  .9
x,y  South  x-1,y  .1 
x,y  West   x-1,y  .9
x,y  West   x,y+1  .1

You then set the Reward to be 0 for all states, but 100 for the goal state, that is, the location you want the robot to get to. 然后，您将所有状态的奖励设置为0，但目标状态设置为100，即您希望机器人到达的位置。

What value-iteration does is its starts by giving a Utility of 100 to the goal state and 0 to all the other states. 什么值迭代的作用是通过向目标状态赋予100的效用而向所有其他状态赋予0。 Then on the first iteration this 100 of utility gets distributed back 1-step from the goal, so all states that can get to the goal state in 1 step (all 4 squares right next to it) will get some utility. 然后在第一次迭代中，这100个实用程序从目标分配回一步，因此所有可以一步到达目标状态的状态（紧邻它的所有4个方格）将获得一些效用。 Namely, they will get a Utility equal to the probability that from that state we can get to the goal stated. 也就是说，他们将得到一个效用等于从该状态我们可以达到所述目标的概率。 We then continue iterating, at each step we move the utility back 1 more step away from the goal. 然后我们继续迭代，在每一步，我们将实用程序从目标移开1步。

In the example above, say you start with R(5,5)= 100 and R(.) = 0 for all other states. 在上面的例子中，假设你从所有其他状态的R（5,5）= 100和R（。）= 0开始。 So the goal is to get to 5,5. 所以目标是达到5,5。

On the first iteration we set 在我们设置的第一次迭代中

R(5,6) = gamma * (.9 * 100) + gamma * (.1 * 100) R（5,6）= gamma *（。9 * 100）+ gamma *（。1 * 100）

because on 5,6 if you go North there is a .9 probability of ending up at 5,5, while if you go West there is a .1 probability of ending up at 5,5. 因为在5,6，如果你去北方有一个.9概率最终在5,5，而如果你去西方有一个.1概率最终在5,5。

Similarly for (5,4), (4,5), (6,5). 类似地，对于（5,4），（4,5），（6,5）。

All other states remain with U = 0 after the first iteration of value iteration. 在值迭代的第一次迭代之后，所有其他状态保持U = 0。

Answer 2

I would recommend using Q-learning for your implementation. 我建议您使用Q-learning来实现。

Maybe you can use this post I wrote as an inspiration. 也许你可以使用我写的这篇文章作为灵感。 This is a Q-learning demo with Java source code . 这是一个带有Java源代码的Q学习演示。 This demo is a map with 6 fields and the AI learns where it should go from every state to get to the reward. 这个演示是一个包含6个字段的地图，AI了解每个州应该从哪里获得奖励。

Q-learning is a technique for letting the AI learn by itself by giving it reward or punishment. Q-learning是一种让AI通过给予奖励或惩罚来自学的技巧。

This example shows the Q-learning used for path finding. 此示例显示用于路径查找的Q学习。 A robot learns where it should go from any state. 机器人会从任何州了解它应该从哪里来。

The robot starts at a random place, it keeps memory of the score while it explores the area, whenever it reaches the goal, we repeat with a new random start. 机器人在一个随机的地方开始，它在探索区域时记录得分，每当它到达目标时，我们重复一个新的随机开始。 After enough repetitions the score values will be stationary (convergence). 在足够重复之后，得分值将是静止的（收敛）。

In this example the action outcome is deterministic (transition probability is 1) and the action selection is random. 在该示例中，动作结果是确定性的（转移概率是1）并且动作选择是随机的。 The score values are calculated by the Q-learning algorithm Q(s,a). 通过Q学习算法Q（s，a）计算得分值。
The image shows the states (A,B,C,D,E,F), possible actions from the states and the reward given. 图像显示了状态（A，B，C，D，E，F），来自状态的可能行为和给出的奖励。

Result Q*(s,a) 结果Q *（s，a）

Policy Π*(s) 政策Π*（s）

Qlearning.java Qlearning.java

 import java.text.DecimalFormat; import java.util.Random; /** * @author Kunuk Nykjaer */ public class Qlearning { final DecimalFormat df = new DecimalFormat("#.##"); // path finding final double alpha = 0.1; final double gamma = 0.9; // states A,B,C,D,E,F // eg from A we can go to B or D // from C we can only go to C // C is goal state, reward 100 when B->C or F->C // // _______ // |A|B|C| // |_____| // |D|E|F| // |_____| // final int stateA = 0; final int stateB = 1; final int stateC = 2; final int stateD = 3; final int stateE = 4; final int stateF = 5; final int statesCount = 6; final int[] states = new int[]{stateA,stateB,stateC,stateD,stateE,stateF}; // http://en.wikipedia.org/wiki/Q-learning // http://people.revoledu.com/kardi/tutorial/ReinforcementLearning/Q-Learning.htm // Q(s,a)= Q(s,a) + alpha * (R(s,a) + gamma * Max(next state, all actions) - Q(s,a)) int[][] R = new int[statesCount][statesCount]; // reward lookup double[][] Q = new double[statesCount][statesCount]; // Q learning int[] actionsFromA = new int[] { stateB, stateD }; int[] actionsFromB = new int[] { stateA, stateC, stateE }; int[] actionsFromC = new int[] { stateC }; int[] actionsFromD = new int[] { stateA, stateE }; int[] actionsFromE = new int[] { stateB, stateD, stateF }; int[] actionsFromF = new int[] { stateC, stateE }; int[][] actions = new int[][] { actionsFromA, actionsFromB, actionsFromC, actionsFromD, actionsFromE, actionsFromF }; String[] stateNames = new String[] { "A", "B", "C", "D", "E", "F" }; public Qlearning() { init(); } public void init() { R[stateB][stateC] = 100; // from b to c R[stateF][stateC] = 100; // from f to c } public static void main(String[] args) { long BEGIN = System.currentTimeMillis(); Qlearning obj = new Qlearning(); obj.run(); obj.printResult(); obj.showPolicy(); long END = System.currentTimeMillis(); System.out.println("Time: " + (END - BEGIN) / 1000.0 + " sec."); } void run() { /* 1. Set parameter , and environment reward matrix R 2. Initialize matrix Q as zero matrix 3. For each episode: Select random initial state Do while not reach goal state o Select one among all possible actions for the current state o Using this possible action, consider to go to the next state o Get maximum Q value of this next state based on all possible actions o Compute o Set the next state as the current state */ // For each episode Random rand = new Random(); for (int i = 0; i < 1000; i++) { // train episodes // Select random initial state int state = rand.nextInt(statesCount); while (state != stateC) // goal state { // Select one among all possible actions for the current state int[] actionsFromState = actions[state]; // Selection strategy is random in this example int index = rand.nextInt(actionsFromState.length); int action = actionsFromState[index]; // Action outcome is set to deterministic in this example // Transition probability is 1 int nextState = action; // data structure // Using this possible action, consider to go to the next state double q = Q(state, action); double maxQ = maxQ(nextState); int r = R(state, action); double value = q + alpha * (r + gamma * maxQ - q); setQ(state, action, value); // Set the next state as the current state state = nextState; } } } double maxQ(int s) { int[] actionsFromState = actions[s]; double maxValue = Double.MIN_VALUE; for (int i = 0; i < actionsFromState.length; i++) { int nextState = actionsFromState[i]; double value = Q[s][nextState]; if (value > maxValue) maxValue = value; } return maxValue; } // get policy from state int policy(int state) { int[] actionsFromState = actions[state]; double maxValue = Double.MIN_VALUE; int policyGotoState = state; // default goto self if not found for (int i = 0; i < actionsFromState.length; i++) { int nextState = actionsFromState[i]; double value = Q[state][nextState]; if (value > maxValue) { maxValue = value; policyGotoState = nextState; } } return policyGotoState; } double Q(int s, int a) { return Q[s][a]; } void setQ(int s, int a, double value) { Q[s][a] = value; } int R(int s, int a) { return R[s][a]; } void printResult() { System.out.println("Print result"); for (int i = 0; i < Q.length; i++) { System.out.print("out from " + stateNames[i] + ": "); for (int j = 0; j < Q[i].length; j++) { System.out.print(df.format(Q[i][j]) + " "); } System.out.println(); } } // policy is maxQ(states) void showPolicy() { System.out.println("\\nshowPolicy"); for (int i = 0; i < states.length; i++) { int from = states[i]; int to = policy(from); System.out.println("from "+stateNames[from]+" goto "+stateNames[to]); } } }

Print result 打印结果

 out from A: 0 90 0 72,9 0 0 out from B: 81 0 100 0 81 0 out from C: 0 0 0 0 0 0 out from D: 81 0 0 0 81 0 out from E: 0 90 0 72,9 0 90 out from F: 0 0 100 0 81 0 showPolicy from a goto B from b goto C from c goto C from d goto A from e goto B from f goto C Time: 0.025 sec.

Answer 3

Not a complete answer, but a clarifying remark. 不是一个完整的答案，而是一个澄清的评论。

The state is not a single cell. 国家不是一个单一的细胞。 The state contains the information what is in each cell for all concerned cells at once. 状态包含所有相关单元格的每个单元格中的信息。 This means one state element contains the information which cells are solid and which are empty; 这意味着一个状态元素包含哪些单元格是实体而哪些单元格是空的信息; which ones contain monsters; 哪些包含怪物; where are coins; 硬币在哪里; where is the player. 玩家在哪里

Maybe you could use a map from each cell to its content as state. 也许您可以使用每个单元格中的映射到其内容作为状态。 This does ignore the movement of monsters and player, which are probably very important, too. 这确实忽略了怪物和玩家的移动，这可能也非常重要。

The details depend on how you want to model your problem (deciding what belongs to the state and in which form). 详细信息取决于您希望如何为问题建模（确定属于哪种状态以及以何种形式）。

Then a policy maps each state to an action like left, right, jump, etc. 然后策略将每个状态映射到左，右，跳等动作。

First you must understand the problem that is expressed by a MDP before thinking about how algorithms like value iteration work. 首先，您必须先了解MDP表达的问题，然后再考虑值迭代等算法的工作原理。

Answer 4

I know this is a fairly old post, but i came across it when looking for MDP related questions, I did want to note (for folks coming in here) a few more comments about when you stated what "s" and "a" were. 我知道这是一个相当古老的帖子，但我在寻找与MDP相关的问题时遇到过它，我确实想要注意（对于那里的人们）还有一些关于你什么时候说“s”和“a”的评论。。

I think for a you are absolutely correct it's your list of [up,down,left,right]. 我认为你绝对正确的是你的[上，下，左，右]列表。

However for s it's really the location in the grid and s' is the location you can go to. 然而，对于s来说，它确实是网格中的位置，s'是您可以去的位置。 What that means is that you pick a state, and then you pick a particular s' and go through all the actions that can take you to that sprime, which you use to figure out those values. 这意味着你选择一个状态，然后你选择一个特定的s'并完成所有可以带你到那个sprime的动作，你用它来计算这些值。 (pick a max out of those). （从中挑出最大值）。 Finally you go for the next s' and do the same thing, when you've exhausted all the s' values then you find the max of what you just finished searching on. 最后你去寻找下一个s'并做同样的事情，当你已经用尽了所有s'值时，你会发现你刚刚搜索到的最大值。

Suppose you picked a grid cell in the corner, you'd only have 2 states you could possibly move to (assuming bottom left corner), depending on how you choose to "name" your states, we could in this case assume a state is an x,y coordinate, so your current state s is 1,1 and your s' (or s prime) list is x+1,y and x,y+1 (no diagonal in this example) (The Summation part that goes over all s') 假设您在角落中选择了一个网格单元，您只能有2个状态可以移动到（假设左下角），这取决于您选择“命名”状态的方式，在这种情况下我们可以假设状态为一个x，y坐标，所以你的当前状态s是1,1而你的s'（或s prime）列表是x + 1，y和x，y + 1（在这个例子中没有对角线）（Summation部分去了超过所有s'）

Also you don't have it listed in your equation, but the max is of a or the action that gives you the max, so first you pick the s' that gives you the max and then within that you pick the action (at least this is my understanding of the algorithm). 你也没有在你的等式中列出它，但是最大值是a或者给你最大值的动作，所以首先你选择给你最大值的s'然后你选择行动（至少这是我对算法的理解）。

So if you had 所以，如果你有

x,y+1 left = 10 
x,y+1 right = 5 

x+1,y left = 3
x+1,y right 2

You'll pick x,y+1 as your s', but then you'll need to pick an action that is maximized which is in this case left for x,y+1. 你将选择x，y + 1作为你的s'，但是你需要选择一个最大化的动作，在这种情况下为x，y + 1。 I'm not sure if there is a subtle difference between just finding the maximum number and finding the state then the maximum number though so maybe someone someday can clarify that for me. 我不确定在找到最大数量和找到状态之间是否存在细微差别，然后是最大数量，但也许有人可能会为我澄清这一点。

If your movements are deterministic (meaning if you say go forward, you go forward with 100% certainty), then it's pretty easy you have one action, However if they are non deterministic, you have a say 80% certainty then you should consider the other actions which could get you there. 如果你的动作是确定性的（意思是如果你说前进，你可以100％确定地前进），那么你很容易做出一个动作，但是如果它们是非确定性的，那么你有80％的确定性，那么你应该考虑其他可以帮助你的行动。 This is the context of the slippery wheel that Jose mentioned above. 这是Jose在上面提到的滑轮的背景。

I don't want to detract what others have said, but just to give some additional information. 我不想贬低别人说的话，只是提供一些额外的信息。

马尔可夫决策过程：价值迭代，它是如何运作的？

问题描述

4 个解决方案

解决方案1
36 2011-12-01 16:02:00

解决方案2
5 2011-12-01 09:34:28

解决方案3
4 2011-12-01 10:28:13

解决方案4
2 2015-11-20 11:38:04

马尔可夫决策过程：价值迭代，它是如何运作的？

问题描述

4 个解决方案

解决方案1 36 2011-12-01 16:02:00

解决方案2 5 2011-12-01 09:34:28

解决方案3 4 2011-12-01 10:28:13

解决方案4 2 2015-11-20 11:38:04

解决方案1
36 2011-12-01 16:02:00

解决方案2
5 2011-12-01 09:34:28

解决方案3
4 2011-12-01 10:28:13

解决方案4
2 2015-11-20 11:38:04