简体   繁体   中英

Dynamic Programming with Python basic problem question

I'm just starting with the Sutton and Barto book. I was trying to replicate some of the easy problems from the book, using the code from here .

I changed the map like so:

def print_board(agent_position):
    fields = list(range(16))
    wall = [1,2,3,8,9,10]
    board = "-----------------\n"
    for i in range(0, 16, 4):
        line = fields[i:i+4]
        for field in line:
            if field == agent_position:
                board += "| A "
            elif field == fields[0]:
                board += "| X "
            elif field in wall:
                board += "| W "
            else:
                board += "|   "
        board += "|\n"
        board += "-----------------\n"     
    print(board)

This will print out a small maze that the agent should navigate. I changed the rewards for the "wall states" to -10 from -1, and changed the Value Iteration code like this:

def iterative_policy_evaluation(policy, theta=0.001, discount_rate=1):
    V_s = {i: 0 for i in range(16)} # 1.
    probablitiy_map = create_probability_map() # 2.
    wall = [1,2,3,8,9,10]

    delta = 100 # 3.
    while not delta < theta: # 4.
        delta = 0 # 5.
        for state in range(16): # 6.
            v = V_s[state] # 7.
            
            total = 0 # 8.
            for action in ["N", "E", "S", "W"]:
                action_total = 0
                for state_prime in range(16):

                    if state_prime not in wall:
                        action_total += probablitiy_map[(state_prime, -1, state, action)] * (-1 + discount_rate * V_s[state_prime])
                    else:
                        action_total += probablitiy_map[(state_prime, -10, state, action)] * (-10 + discount_rate * V_s[state_prime])
                        
                total += policy[state][action] * action_total  
            V_s[state] = round(total, 1) # 9.
            delta = max(delta, abs(v - V_s[state])) # 10.
    return V_s # 11.

I left everything else same as in the example. But unfortunately, my value iterations are producing sub-optimal results:

State Value: {0: 0.0, 1: -1.0, 2: -11.0, 3: -9.5, 4: -1.0, 5: -6.5, 6: -7.5, 7: -8.5, 8: -2.0, 9: -12.0, 10: -8.5, 11: -14.0, 12: -12.0, 13: -17.5, 14: -18.5, 15: -15.0}

Clearly the state value of, for example the farthest state 12 should be 8. But it's 12, and so on. Why does the agent insist on going through the wall, although there exist less costly policies? What am I missing here?

EDIT: The probability map looks like this:

[state_prime, reward, state, action] probability

(0, -1, 0, 'N') 1
(0, -1, 0, 'E') 1
(0, -1, 0, 'S') 1
(0, -1, 0, 'W') 1
(1, -10, 1, 'N') 1
(2, -10, 1, 'E') 1
(5, -1, 1, 'S') 1
(0, -1, 1, 'W') 1
(2, -10, 2, 'N') 1
(3, -10, 2, 'E') 1
(6, -1, 2, 'S') 1
(1, -10, 2, 'W') 1
(3, -10, 3, 'N') 1
(3, -10, 3, 'E') 1
(7, -1, 3, 'S') 1
(2, -10, 3, 'W') 1
(0, -1, 4, 'N') 1
(5, -1, 4, 'E') 1
(8, -10, 4, 'S') 1
(4, -1, 4, 'W') 1
(1, -10, 5, 'N') 1
(6, -1, 5, 'E') 1
(9, -10, 5, 'S') 1
(4, -1, 5, 'W') 1
(2, -10, 6, 'N') 1
(7, -1, 6, 'E') 1
(10, -10, 6, 'S') 1
(5, -1, 6, 'W') 1
(3, -10, 7, 'N') 1
(7, -1, 7, 'E') 1
(11, -1, 7, 'S') 1
(6, -1, 7, 'W') 1
(4, -1, 8, 'N') 1
(9, -10, 8, 'E') 1
(12, -1, 8, 'S') 1
(8, -10, 8, 'W') 1
(5, -1, 9, 'N') 1
(10, -10, 9, 'E') 1
(13, -1, 9, 'S') 1
(8, -10, 9, 'W') 1
(6, -1, 10, 'N') 1
(11, -1, 10, 'E') 1
(14, -1, 10, 'S') 1
(9, -10, 10, 'W') 1
(7, -1, 11, 'N') 1
(11, -1, 11, 'E') 1
(15, -1, 11, 'S') 1
(10, -10, 11, 'W') 1
(8, -10, 12, 'N') 1
(13, -1, 12, 'E') 1
(12, -1, 12, 'S') 1
(12, -1, 12, 'W') 1
(9, -10, 13, 'N') 1
(14, -1, 13, 'E') 1
(13, -1, 13, 'S') 1
(12, -1, 13, 'W') 1
(10, -10, 14, 'N') 1
(15, -1, 14, 'E') 1
(14, -1, 14, 'S') 1
(13, -1, 14, 'W') 1
(11, -1, 15, 'N') 1
(15, -1, 15, 'E') 1
(15, -1, 15, 'S') 1
(14, -1, 15, 'W') 1

Following up on xjcl 's question made me think about the concept of a wall. Turns out my wall was an unnatural "one-sided" wall, with high penalty for entering but not leaving. Fixing this in the probability map yielded the desired result. Thank you xjcl .

UPDATE: On further examination, the policy improvement part of the code from the example turned out to be a simplified version of the algo, which did not fully take into account rewards. Full implementation of the book algo made everything work!

def create_greedy_policy(V_s, discount_rate=1):
    s_to_sprime = create_state_to_state_prime_verbose_map()
    policy = {}
    probablitiy_map = create_probability_map() # 2.
        
    for state in range(16):

        if state == 0:
            policy[state] = {'N': 0.0, 'E': 0.0, 'S': 0.0, 'W': 0.0}
        
        else:
            actions={}

            for action in ["N", "E", "S", "W"]:

                real_action=0
                for state_prime in range(16):

                    if state_prime not in wall:
                        action_value = probablitiy_map[(state_prime, -1, state, action)] * (-1 + discount_rate * V_s[state_prime])
                        if action_value != 0:
                            real_action += action_value
                    else:
                        action_value = probablitiy_map[(state_prime, -10, state, action)] * (-10 + discount_rate * V_s[state_prime])
                        if action_value != 0:
                            real_action += action_value

                actions.update({action:real_action})

            max_actions = [k for k,v in actions.items() if v == max(actions.values())]

            policy[state] = {a: 1 / len(max_actions) if a in max_actions else 0.0 for a in ['N', 'S', 'E', 'W']}
            
    return policy

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM