簡體   English   中英

Dynamic Programming with Python 基本題題

[英]Dynamic Programming with Python basic problem question

我只是從 Sutton 和 Barto 的書開始。 我試圖使用此處的代碼復制書中的一些簡單問題。

我像這樣更改了 map:

def print_board(agent_position):
    fields = list(range(16))
    wall = [1,2,3,8,9,10]
    board = "-----------------\n"
    for i in range(0, 16, 4):
        line = fields[i:i+4]
        for field in line:
            if field == agent_position:
                board += "| A "
            elif field == fields[0]:
                board += "| X "
            elif field in wall:
                board += "| W "
            else:
                board += "|   "
        board += "|\n"
        board += "-----------------\n"     
    print(board)

這將打印出代理應該導航的小迷宮。 我將“牆狀態”的獎勵從 -1 更改為 -10,並將值迭代代碼更改為如下所示:

def iterative_policy_evaluation(policy, theta=0.001, discount_rate=1):
    V_s = {i: 0 for i in range(16)} # 1.
    probablitiy_map = create_probability_map() # 2.
    wall = [1,2,3,8,9,10]

    delta = 100 # 3.
    while not delta < theta: # 4.
        delta = 0 # 5.
        for state in range(16): # 6.
            v = V_s[state] # 7.
            
            total = 0 # 8.
            for action in ["N", "E", "S", "W"]:
                action_total = 0
                for state_prime in range(16):

                    if state_prime not in wall:
                        action_total += probablitiy_map[(state_prime, -1, state, action)] * (-1 + discount_rate * V_s[state_prime])
                    else:
                        action_total += probablitiy_map[(state_prime, -10, state, action)] * (-10 + discount_rate * V_s[state_prime])
                        
                total += policy[state][action] * action_total  
            V_s[state] = round(total, 1) # 9.
            delta = max(delta, abs(v - V_s[state])) # 10.
    return V_s # 11.

我將其他所有內容都保留在示例中。 但不幸的是,我的價值迭代產生了次優結果:

State 值:{0:0.0,1:-1.0,2:-11.0,3:-9.5,4:-1.0,5:-6.5,6:-7.5,7:-8.5,8:-2.0,9: -12.0, 10: -8.5, 11: -14.0, 12: -12.0, 13: -17.5, 14: -18.5, 15: -15.0}

顯然state的值,比如最遠的state 12應該是8。但是是12,以此類推。 為什么代理人堅持要穿牆,盡管有更便宜的政策? 我在這里錯過了什么?

編輯:概率 map 如下所示:

[state_prime, reward, state, action] 概率

(0, -1, 0, 'N') 1
(0, -1, 0, 'E') 1
(0, -1, 0, 'S') 1
(0, -1, 0, 'W') 1
(1, -10, 1, 'N') 1
(2, -10, 1, 'E') 1
(5, -1, 1, 'S') 1
(0, -1, 1, 'W') 1
(2, -10, 2, 'N') 1
(3, -10, 2, 'E') 1
(6, -1, 2, 'S') 1
(1, -10, 2, 'W') 1
(3, -10, 3, 'N') 1
(3, -10, 3, 'E') 1
(7, -1, 3, 'S') 1
(2, -10, 3, 'W') 1
(0, -1, 4, 'N') 1
(5, -1, 4, 'E') 1
(8, -10, 4, 'S') 1
(4, -1, 4, 'W') 1
(1, -10, 5, 'N') 1
(6, -1, 5, 'E') 1
(9, -10, 5, 'S') 1
(4, -1, 5, 'W') 1
(2, -10, 6, 'N') 1
(7, -1, 6, 'E') 1
(10, -10, 6, 'S') 1
(5, -1, 6, 'W') 1
(3, -10, 7, 'N') 1
(7, -1, 7, 'E') 1
(11, -1, 7, 'S') 1
(6, -1, 7, 'W') 1
(4, -1, 8, 'N') 1
(9, -10, 8, 'E') 1
(12, -1, 8, 'S') 1
(8, -10, 8, 'W') 1
(5, -1, 9, 'N') 1
(10, -10, 9, 'E') 1
(13, -1, 9, 'S') 1
(8, -10, 9, 'W') 1
(6, -1, 10, 'N') 1
(11, -1, 10, 'E') 1
(14, -1, 10, 'S') 1
(9, -10, 10, 'W') 1
(7, -1, 11, 'N') 1
(11, -1, 11, 'E') 1
(15, -1, 11, 'S') 1
(10, -10, 11, 'W') 1
(8, -10, 12, 'N') 1
(13, -1, 12, 'E') 1
(12, -1, 12, 'S') 1
(12, -1, 12, 'W') 1
(9, -10, 13, 'N') 1
(14, -1, 13, 'E') 1
(13, -1, 13, 'S') 1
(12, -1, 13, 'W') 1
(10, -10, 14, 'N') 1
(15, -1, 14, 'E') 1
(14, -1, 14, 'S') 1
(13, -1, 14, 'W') 1
(11, -1, 15, 'N') 1
(15, -1, 15, 'E') 1
(15, -1, 15, 'S') 1
(14, -1, 15, 'W') 1

跟進xjcl的問題讓我想到了牆的概念。 原來我的牆是一堵不自然的“單面”牆,進入但不離開會受到很高的懲罰。 在概率 map 中修復此問題會產生所需的結果。 謝謝xjcl

更新:經過進一步檢查,示例中代碼的策略改進部分原來是算法的簡化版本,它沒有完全考慮獎勵。 本書算法的全面實施使一切正常!

def create_greedy_policy(V_s, discount_rate=1):
    s_to_sprime = create_state_to_state_prime_verbose_map()
    policy = {}
    probablitiy_map = create_probability_map() # 2.
        
    for state in range(16):

        if state == 0:
            policy[state] = {'N': 0.0, 'E': 0.0, 'S': 0.0, 'W': 0.0}
        
        else:
            actions={}

            for action in ["N", "E", "S", "W"]:

                real_action=0
                for state_prime in range(16):

                    if state_prime not in wall:
                        action_value = probablitiy_map[(state_prime, -1, state, action)] * (-1 + discount_rate * V_s[state_prime])
                        if action_value != 0:
                            real_action += action_value
                    else:
                        action_value = probablitiy_map[(state_prime, -10, state, action)] * (-10 + discount_rate * V_s[state_prime])
                        if action_value != 0:
                            real_action += action_value

                actions.update({action:real_action})

            max_actions = [k for k,v in actions.items() if v == max(actions.values())]

            policy[state] = {a: 1 / len(max_actions) if a in max_actions else 0.0 for a in ['N', 'S', 'E', 'W']}
            
    return policy

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM