简体   繁体   中英

Understanding The Value Iteration Algorithm of Markov Decision Processes

In learning about MDP 's I am having trouble with value iteration . Conceptually this example is very simple and makes sense:

If you have a 6 sided dice, and you roll a 4 or a 5 or a 6 you keep that amount in $ but if you roll a 1 or a 2 or a 3 you loose your bankroll and end the game.

In the beginning you have $0 so the choice between rolling and not rolling is:

k = 1
If I roll : 1/6*0 + 1/6*0 + 1/6*0 + 1/6*4 + 1/6*5 + 1/6*6 = 2.5 
I I don't roll : 0
since 2.5 > 0 I should roll

k = 2:
If I roll and get a 4:
    If I roll again: 4 + 1/6*(-4) + 1/6*(-4) + 1/6*(-4) + 1/6*4 + 1/6*5 + 1/6*6 = 4.5
    If I don't roll: 4
    since 4.5 is greater than 4 I should roll

If I roll and get a 5:
    If I roll again: 5 + 1/6*(-5) + 1/6*(-5) + 1/6*(-5) + 1/6*4 + 1/6*5 + 1/6*6 = 5
    If I don't roll: 5
    Since the difference is 0 I should not roll

If I roll and get a 6:
    If I roll again: 6 + 1/6*(-6) + 1/6*(-5) + 1/6*(-5) + 1/6*4 + 1/6*5 + 1/6*6 = 5.5
    If I don't roll: 6
    Since the difference is -0.5 I should not roll

What I am having trouble with is converting that into python code. Not because I am not good with python, but maybe my understanding of the pseudocode is wrong. Even though the Bellman equation does make sense to me.

I borrowed the Berkley code for value iteration and modified it to:

isBadSide = [1,1,1,0,0,0]

def R(s):
    if isBadSide[s-1]:
        return -s
    return s

def T(s, a, N):
    return [(1./N, s)]

def value_iteration(N, epsilon=0.001):
    "Solving an MDP by value iteration. [Fig. 17.4]"
    U1 = dict([(s, 0) for s in range(1, N+1)])
    while True:
        U = U1.copy()
        delta = 0
        for s in range(1, N+1):
            U1[s] = R(s) + max([sum([p * U[s1] for (p, s1) in T(s, a, N)])
                                        for a in ('s', 'g',)])

            delta = max(delta, abs(U1[s] - U[s]))

        if delta < epsilon:
             return U

    print(value_iteration(6))
    # {1: -1.1998456790123457, 2: -2.3996913580246915, 3: -3.599537037037037, 4: 4.799382716049383, 5: 5.999228395061729, 6: 7.199074074074074}

Which is the wrong answer. Where is the bug in this code? Or is it an issue of my understanding of the algorithm?

Let B be your current balance.

If you choose to roll, the expected reward is 2.5 - B * 0.5 .

If you choose not to roll, the expected reward is 0 .

So, the policy is this: If B < 5 , roll. Otherwise, don't.

And the expected reward on each step when following that policy is V = max(0, 2.5 - B * 0.5) .


Now, if you want to express it in terms of the Bellman equation, you need to incorporate the balance into the state.

Let the state <Balance, GameIsOver> consist of the current balance and the flag that defines whether the game is over.

  • Action stop :
    • turns the state <B, false> into <B, true>
  • Action roll :
    • turns <B, false> into <0, true> with the probability 1/2
    • turns <B, false> into <B + 4, false> with the probability 1/6
    • turns <B, false> into <B + 5, false> with the probability 1/6
    • turns <B, false> into <B + 6, false> with the probability 1/6
  • No action can turn <B1, true> into <B2, false>

Using the notation from here :

π(<B, false>) = "roll", if B < 5

π(<B, false>) = "stop", if B >= 5

V(<B, false>) = 2.5 - B * 0.5, if B < 5

V(<B, false>) = 0, if B >= 5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM