[英]Dynamic Programming with Python basic problem question
我只是從 Sutton 和 Barto 的書開始。 我試圖使用此處的代碼復制書中的一些簡單問題。
我像這樣更改了 map:
def print_board(agent_position):
fields = list(range(16))
wall = [1,2,3,8,9,10]
board = "-----------------\n"
for i in range(0, 16, 4):
line = fields[i:i+4]
for field in line:
if field == agent_position:
board += "| A "
elif field == fields[0]:
board += "| X "
elif field in wall:
board += "| W "
else:
board += "| "
board += "|\n"
board += "-----------------\n"
print(board)
這將打印出代理應該導航的小迷宮。 我將“牆狀態”的獎勵從 -1 更改為 -10,並將值迭代代碼更改為如下所示:
def iterative_policy_evaluation(policy, theta=0.001, discount_rate=1):
V_s = {i: 0 for i in range(16)} # 1.
probablitiy_map = create_probability_map() # 2.
wall = [1,2,3,8,9,10]
delta = 100 # 3.
while not delta < theta: # 4.
delta = 0 # 5.
for state in range(16): # 6.
v = V_s[state] # 7.
total = 0 # 8.
for action in ["N", "E", "S", "W"]:
action_total = 0
for state_prime in range(16):
if state_prime not in wall:
action_total += probablitiy_map[(state_prime, -1, state, action)] * (-1 + discount_rate * V_s[state_prime])
else:
action_total += probablitiy_map[(state_prime, -10, state, action)] * (-10 + discount_rate * V_s[state_prime])
total += policy[state][action] * action_total
V_s[state] = round(total, 1) # 9.
delta = max(delta, abs(v - V_s[state])) # 10.
return V_s # 11.
我將其他所有內容都保留在示例中。 但不幸的是,我的價值迭代產生了次優結果:
State 值:{0:0.0,1:-1.0,2:-11.0,3:-9.5,4:-1.0,5:-6.5,6:-7.5,7:-8.5,8:-2.0,9: -12.0, 10: -8.5, 11: -14.0, 12: -12.0, 13: -17.5, 14: -18.5, 15: -15.0}
顯然state的值,比如最遠的state 12應該是8。但是是12,以此類推。 為什么代理人堅持要穿牆,盡管有更便宜的政策? 我在這里錯過了什么?
編輯:概率 map 如下所示:
[state_prime, reward, state, action] 概率
(0, -1, 0, 'N') 1
(0, -1, 0, 'E') 1
(0, -1, 0, 'S') 1
(0, -1, 0, 'W') 1
(1, -10, 1, 'N') 1
(2, -10, 1, 'E') 1
(5, -1, 1, 'S') 1
(0, -1, 1, 'W') 1
(2, -10, 2, 'N') 1
(3, -10, 2, 'E') 1
(6, -1, 2, 'S') 1
(1, -10, 2, 'W') 1
(3, -10, 3, 'N') 1
(3, -10, 3, 'E') 1
(7, -1, 3, 'S') 1
(2, -10, 3, 'W') 1
(0, -1, 4, 'N') 1
(5, -1, 4, 'E') 1
(8, -10, 4, 'S') 1
(4, -1, 4, 'W') 1
(1, -10, 5, 'N') 1
(6, -1, 5, 'E') 1
(9, -10, 5, 'S') 1
(4, -1, 5, 'W') 1
(2, -10, 6, 'N') 1
(7, -1, 6, 'E') 1
(10, -10, 6, 'S') 1
(5, -1, 6, 'W') 1
(3, -10, 7, 'N') 1
(7, -1, 7, 'E') 1
(11, -1, 7, 'S') 1
(6, -1, 7, 'W') 1
(4, -1, 8, 'N') 1
(9, -10, 8, 'E') 1
(12, -1, 8, 'S') 1
(8, -10, 8, 'W') 1
(5, -1, 9, 'N') 1
(10, -10, 9, 'E') 1
(13, -1, 9, 'S') 1
(8, -10, 9, 'W') 1
(6, -1, 10, 'N') 1
(11, -1, 10, 'E') 1
(14, -1, 10, 'S') 1
(9, -10, 10, 'W') 1
(7, -1, 11, 'N') 1
(11, -1, 11, 'E') 1
(15, -1, 11, 'S') 1
(10, -10, 11, 'W') 1
(8, -10, 12, 'N') 1
(13, -1, 12, 'E') 1
(12, -1, 12, 'S') 1
(12, -1, 12, 'W') 1
(9, -10, 13, 'N') 1
(14, -1, 13, 'E') 1
(13, -1, 13, 'S') 1
(12, -1, 13, 'W') 1
(10, -10, 14, 'N') 1
(15, -1, 14, 'E') 1
(14, -1, 14, 'S') 1
(13, -1, 14, 'W') 1
(11, -1, 15, 'N') 1
(15, -1, 15, 'E') 1
(15, -1, 15, 'S') 1
(14, -1, 15, 'W') 1
跟進xjcl的問題讓我想到了牆的概念。 原來我的牆是一堵不自然的“單面”牆,進入但不離開會受到很高的懲罰。 在概率 map 中修復此問題會產生所需的結果。 謝謝xjcl 。
更新:經過進一步檢查,示例中代碼的策略改進部分原來是算法的簡化版本,它沒有完全考慮獎勵。 本書算法的全面實施使一切正常!
def create_greedy_policy(V_s, discount_rate=1):
s_to_sprime = create_state_to_state_prime_verbose_map()
policy = {}
probablitiy_map = create_probability_map() # 2.
for state in range(16):
if state == 0:
policy[state] = {'N': 0.0, 'E': 0.0, 'S': 0.0, 'W': 0.0}
else:
actions={}
for action in ["N", "E", "S", "W"]:
real_action=0
for state_prime in range(16):
if state_prime not in wall:
action_value = probablitiy_map[(state_prime, -1, state, action)] * (-1 + discount_rate * V_s[state_prime])
if action_value != 0:
real_action += action_value
else:
action_value = probablitiy_map[(state_prime, -10, state, action)] * (-10 + discount_rate * V_s[state_prime])
if action_value != 0:
real_action += action_value
actions.update({action:real_action})
max_actions = [k for k,v in actions.items() if v == max(actions.values())]
policy[state] = {a: 1 / len(max_actions) if a in max_actions else 0.0 for a in ['N', 'S', 'E', 'W']}
return policy
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.