简体   繁体   中英

Value iteration not converging - Markov decision process

I am having an issue with the results I am getting from performing value iteration, with the numbers increasing to infinity so I assume I have a problem somewhere in my logic.

Initially I have a 10x10 grid, some tiles with a reward of +10, some with a reward of -100, and some with a reward of 0. There are no terminal states. The agent can perform 4 non-deterministic actions: move up, down, left, and right. It has an 80% chance of moving in the chosen direction, and a 20% chance of moving perpendicularly.

My process is to loop over the following:

  • For every tile, calculate the value of the best action from that tile

For example to calculate the value of going north from a given tile:

self.northVal = 0
self.northVal += (0.1 * grid[x-1][y])
self.northVal += (0.1 * grid[x+1][y])
self.northVal += (0.8 * grid[x][y+1])
  • For every tile, update its value to be: the initial reward + ( 0.5 * the value of the best move for that tile )
  • Check to see if the updated grid has the changed since the last loop, and if not, stop the loop as the numbers have converged.

I would appreciate any guidance!

What you're trying to do here is not Value Iteration: value iteration works with a state value function, where you store a value for each state. This means, in value iteration, you don't keep an estimate of each (state,action) pair.

Please refer the 2nd edition of Sutton and Barto book (Section 4.4) for explanation, but here's the algorithm for quick reference. Note the initialization step: you only need a vector storing the value for each state.

值迭代算法

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM