Dynamic Programming with Python 基本题题

Question

I'm just starting with the Sutton and Barto book.我只是从 Sutton 和 Barto 的书开始。 I was trying to replicate some of the easy problems from the book, using the code from here .我试图使用此处的代码复制书中的一些简单问题。

I changed the map like so:我像这样更改了 map：

def print_board(agent_position):
    fields = list(range(16))
    wall = [1,2,3,8,9,10]
    board = "-----------------\n"
    for i in range(0, 16, 4):
        line = fields[i:i+4]
        for field in line:
            if field == agent_position:
                board += "| A "
            elif field == fields[0]:
                board += "| X "
            elif field in wall:
                board += "| W "
            else:
                board += "|   "
        board += "|\n"
        board += "-----------------\n"     
    print(board)

This will print out a small maze that the agent should navigate.这将打印出代理应该导航的小迷宫。 I changed the rewards for the "wall states" to -10 from -1, and changed the Value Iteration code like this:我将“墙状态”的奖励从 -1 更改为 -10，并将值迭代代码更改为如下所示：

def iterative_policy_evaluation(policy, theta=0.001, discount_rate=1):
    V_s = {i: 0 for i in range(16)} # 1.
    probablitiy_map = create_probability_map() # 2.
    wall = [1,2,3,8,9,10]

    delta = 100 # 3.
    while not delta < theta: # 4.
        delta = 0 # 5.
        for state in range(16): # 6.
            v = V_s[state] # 7.
            
            total = 0 # 8.
            for action in ["N", "E", "S", "W"]:
                action_total = 0
                for state_prime in range(16):

                    if state_prime not in wall:
                        action_total += probablitiy_map[(state_prime, -1, state, action)] * (-1 + discount_rate * V_s[state_prime])
                    else:
                        action_total += probablitiy_map[(state_prime, -10, state, action)] * (-10 + discount_rate * V_s[state_prime])
                        
                total += policy[state][action] * action_total  
            V_s[state] = round(total, 1) # 9.
            delta = max(delta, abs(v - V_s[state])) # 10.
    return V_s # 11.

I left everything else same as in the example.我将其他所有内容都保留在示例中。 But unfortunately, my value iterations are producing sub-optimal results:但不幸的是，我的价值迭代产生了次优结果：

State Value: {0: 0.0, 1: -1.0, 2: -11.0, 3: -9.5, 4: -1.0, 5: -6.5, 6: -7.5, 7: -8.5, 8: -2.0, 9: -12.0, 10: -8.5, 11: -14.0, 12: -12.0, 13: -17.5, 14: -18.5, 15: -15.0} State 值：{0：0.0，1：-1.0，2：-11.0，3：-9.5，4：-1.0，5：-6.5，6：-7.5，7：-8.5，8：-2.0，9： -12.0, 10: -8.5, 11: -14.0, 12: -12.0, 13: -17.5, 14: -18.5, 15: -15.0}

Clearly the state value of, for example the farthest state 12 should be 8. But it's 12, and so on.显然state的值，比如最远的state 12应该是8。但是是12，以此类推。 Why does the agent insist on going through the wall, although there exist less costly policies?为什么代理人坚持要穿墙，尽管有更便宜的政策？ What am I missing here?我在这里错过了什么？

EDIT: The probability map looks like this:编辑：概率 map 如下所示：

[state_prime, reward, state, action] probability [state_prime, reward, state, action] 概率

(0, -1, 0, 'N') 1
(0, -1, 0, 'E') 1
(0, -1, 0, 'S') 1
(0, -1, 0, 'W') 1
(1, -10, 1, 'N') 1
(2, -10, 1, 'E') 1
(5, -1, 1, 'S') 1
(0, -1, 1, 'W') 1
(2, -10, 2, 'N') 1
(3, -10, 2, 'E') 1
(6, -1, 2, 'S') 1
(1, -10, 2, 'W') 1
(3, -10, 3, 'N') 1
(3, -10, 3, 'E') 1
(7, -1, 3, 'S') 1
(2, -10, 3, 'W') 1
(0, -1, 4, 'N') 1
(5, -1, 4, 'E') 1
(8, -10, 4, 'S') 1
(4, -1, 4, 'W') 1
(1, -10, 5, 'N') 1
(6, -1, 5, 'E') 1
(9, -10, 5, 'S') 1
(4, -1, 5, 'W') 1
(2, -10, 6, 'N') 1
(7, -1, 6, 'E') 1
(10, -10, 6, 'S') 1
(5, -1, 6, 'W') 1
(3, -10, 7, 'N') 1
(7, -1, 7, 'E') 1
(11, -1, 7, 'S') 1
(6, -1, 7, 'W') 1
(4, -1, 8, 'N') 1
(9, -10, 8, 'E') 1
(12, -1, 8, 'S') 1
(8, -10, 8, 'W') 1
(5, -1, 9, 'N') 1
(10, -10, 9, 'E') 1
(13, -1, 9, 'S') 1
(8, -10, 9, 'W') 1
(6, -1, 10, 'N') 1
(11, -1, 10, 'E') 1
(14, -1, 10, 'S') 1
(9, -10, 10, 'W') 1
(7, -1, 11, 'N') 1
(11, -1, 11, 'E') 1
(15, -1, 11, 'S') 1
(10, -10, 11, 'W') 1
(8, -10, 12, 'N') 1
(13, -1, 12, 'E') 1
(12, -1, 12, 'S') 1
(12, -1, 12, 'W') 1
(9, -10, 13, 'N') 1
(14, -1, 13, 'E') 1
(13, -1, 13, 'S') 1
(12, -1, 13, 'W') 1
(10, -10, 14, 'N') 1
(15, -1, 14, 'E') 1
(14, -1, 14, 'S') 1
(13, -1, 14, 'W') 1
(11, -1, 15, 'N') 1
(15, -1, 15, 'E') 1
(15, -1, 15, 'S') 1
(14, -1, 15, 'W') 1

Answer 1

Following up on xjcl 's question made me think about the concept of a wall.跟进xjcl的问题让我想到了墙的概念。 Turns out my wall was an unnatural "one-sided" wall, with high penalty for entering but not leaving.原来我的墙是一堵不自然的“单面”墙，进入但不离开会受到很高的惩罚。 Fixing this in the probability map yielded the desired result.在概率 map 中修复此问题会产生所需的结果。 Thank you xjcl .谢谢xjcl 。

UPDATE: On further examination, the policy improvement part of the code from the example turned out to be a simplified version of the algo, which did not fully take into account rewards.更新：经过进一步检查，示例中代码的策略改进部分原来是算法的简化版本，它没有完全考虑奖励。 Full implementation of the book algo made everything work!本书算法的全面实施使一切正常！

def create_greedy_policy(V_s, discount_rate=1):
    s_to_sprime = create_state_to_state_prime_verbose_map()
    policy = {}
    probablitiy_map = create_probability_map() # 2.
        
    for state in range(16):

        if state == 0:
            policy[state] = {'N': 0.0, 'E': 0.0, 'S': 0.0, 'W': 0.0}
        
        else:
            actions={}

            for action in ["N", "E", "S", "W"]:

                real_action=0
                for state_prime in range(16):

                    if state_prime not in wall:
                        action_value = probablitiy_map[(state_prime, -1, state, action)] * (-1 + discount_rate * V_s[state_prime])
                        if action_value != 0:
                            real_action += action_value
                    else:
                        action_value = probablitiy_map[(state_prime, -10, state, action)] * (-10 + discount_rate * V_s[state_prime])
                        if action_value != 0:
                            real_action += action_value

                actions.update({action:real_action})

            max_actions = [k for k,v in actions.items() if v == max(actions.values())]

            policy[state] = {a: 1 / len(max_actions) if a in max_actions else 0.0 for a in ['N', 'S', 'E', 'W']}
            
    return policy

Dynamic Programming with Python 基本题题

问题描述

1 个解决方案

解决方案1
0 2020-11-14 06:27:33

Dynamic Programming with Python 基本题题

问题描述

1 个解决方案

解决方案1 0 2020-11-14 06:27:33

解决方案1
0 2020-11-14 06:27:33