简体   繁体   English

值迭代不收敛-马尔可夫决策过程

[英]Value iteration not converging - Markov decision process

I am having an issue with the results I am getting from performing value iteration, with the numbers increasing to infinity so I assume I have a problem somewhere in my logic. 我从执行值迭代得到的结果有问题,数字增加到无穷大,所以我认为我的逻辑上有问题。

Initially I have a 10x10 grid, some tiles with a reward of +10, some with a reward of -100, and some with a reward of 0. There are no terminal states. 最初,我有一个10x10的网格,有些瓷砖的奖励为+10,有些瓷砖的奖励为-100,有些瓷砖的奖励为0。没有终端状态。 The agent can perform 4 non-deterministic actions: move up, down, left, and right. 代理可以执行4个不确定动作:向上,向下,向左和向右移动。 It has an 80% chance of moving in the chosen direction, and a 20% chance of moving perpendicularly. 它有80%的机会沿所选方向移动,而20%的机会沿垂直方向移动。

My process is to loop over the following: 我的过程是遍历以下内容:

  • For every tile, calculate the value of the best action from that tile 对于每个图块,从该图块计算最佳操作的值

For example to calculate the value of going north from a given tile: 例如,要计算从给定图块向北移动的值:

self.northVal = 0
self.northVal += (0.1 * grid[x-1][y])
self.northVal += (0.1 * grid[x+1][y])
self.northVal += (0.8 * grid[x][y+1])
  • For every tile, update its value to be: the initial reward + ( 0.5 * the value of the best move for that tile ) 对于每个图块,将其值更新为:初始奖励+(0.5 *该图块的最佳移动值)
  • Check to see if the updated grid has the changed since the last loop, and if not, stop the loop as the numbers have converged. 检查自上一次循环以来,更新后的网格是否已更改;如果未更改,请在数字收敛后停止循环。

I would appreciate any guidance! 我将不胜感激!

What you're trying to do here is not Value Iteration: value iteration works with a state value function, where you store a value for each state. 您在此处尝试执行的操作不是值迭代:值迭代与状态值函数一起使用,该函数在其中存储每个状态的值。 This means, in value iteration, you don't keep an estimate of each (state,action) pair. 这意味着,在值迭代中,您不必保留每个(状态,动作)对的估计。

Please refer the 2nd edition of Sutton and Barto book (Section 4.4) for explanation, but here's the algorithm for quick reference. 请参阅Sutton and Barto书的第2版(第4.4节)进行解释,但是这里是用于快速参考的算法。 Note the initialization step: you only need a vector storing the value for each state. 注意初始化步骤:您只需要一个向量来存储每个状态的值。

值迭代算法

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM