简体   繁体   中英

Convergence of value iteration

Why the termination condition of value-iteration algorithm ( example http://aima-java.googlecode.com/svn/trunk/aima-core/src/main/java/aima/core/probability/mdp/search/ValueIteration.java )

In the MDP (Markov Desicion Process) is

||Ui+1-Ui||< error*(1-gamma)/gamma, where

Ui is vector of utilities
Ui+1 updated vector of utilities

error -error bound used in algorithm

gamma-discount factor used in algorithm

Where does "error*(1-gamma)/gamma" come from? "divided by gamma" is because every step is discounted by gamma? But error*(1-gamma)? And how big must be an error?

That's called a Bellman Error or a Bellman Residual.

See Williams and Baird , 1993 for use in MDPs.

See Littman , 1994 for use in POMDPs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM