在值迭代中重复效用值（马尔可夫决策过程）

Question

我正在尝试使用python实现Markov Decision Process的值迭代算法。 我有一个实现。 但是，这给了我许多实用程序的重复值。 我的过渡矩阵非常稀疏。 可能是这引起了问题。 但是，我不确定这个假设是否正确。 我该如何纠正？ 该代码可能很伪劣。 我对值迭代非常陌生。 因此，请帮助我确定我的代码有问题。 参考代码是这样的： http : //carlo-hamalainen.net/stuff/mdpnotes/ 。 我已经使用了ipod_mdp.py代码文件。 这是我的实现的代码段：

num_of_states = 470   #total number of states

#initialization
V1 = [0.25] * num_of_states

get_target_index = state_index[(u'48.137654',   u'11.579949')]  #each state is a location

#print "The target index is ", get_target_index

V1[get_target_index] = -100    #assigning least cost to the target state

V2 = [0.0] * num_of_states

policy = [0.0] * num_of_states

count = 0.0

while max([abs(V1[i] - V2[i]) for i in range(num_of_states)]) > 0.001:
    print max([abs(V1[i] - V2[i]) for i in range(num_of_states)])
    print count

    for s in range(num_of_states):   #for each state
        #initialize minimum action to the first action in the list
        min_action = actions_index[actions[0]]   #initialize - get the action index for the first iteration  

        min_action_cost = cost[s, actions_index[actions[0]]]  #initialize the cost

        for w in range(num_of_states):              

            if (s, state_index[actions[0]], w) in transitions:  #if this transition exists in the matrix - non-zero value
                min_action_cost += 0.9 * transitions[s, state_index[actions[0]], w] * V1[w]

            else:
                min_action_cost += 0.9 * 0.001 * V1[w]   #if not - give it a small value of 0.001 instead of 0.0

        #get the minimum action cost for the state
        for a in actions:

            this_cost = cost[s, actions_index[a]]

            for w in range(num_of_states):          

            #   if index_state[w] != 'm': 
                if (s, state_index[a], w) in transitions:
                    this_cost += 0.9 * transitions[s, state_index[a], w] * V1[w]
                else:
                    this_cost += 0.9 * 0.001 * V1[w] 

            if this_cost < min_action_cost:

                min_action = actions_index[a]
                min_action_cost = this_cost

        V2[s] = min_action_cost

        policy[s] = min_action

    V1, V2 = V2, V1    #swap

    count += 1

非常感谢你。

Answer 1

我不确定我是否完全理解您的代码。 如果有人需要，我将在这里保留我的实现。

import numpy as np

def valueIteration(R, P, discount, threshold):
    V = np.copy(R)
    old_V = np.copy(V)
    error = float("inf")
    while error > threshold:
        old_V, V = (V, old_V)
        max_values = np.dot(P, old_V).max(axis=1)
        np.copyto(V, R + discount * max_values)
        error = np.linalg.norm(V - old_V)
    return V

S = 30
A = 4
R = np.zeros(S)
# Goal state S-1
R[S-2] = 1

P = np.random.rand(S,A,S)
# Goal state goes to dwell state
P[S-2,:,:] = 0
P[S-2,:,S-1] = 1
P[S-1,:,:] = 0
P[S-1,:,S-1] = 1
for s in range(S-2): #goal and dwell states do not need normalization
    for a in range(A):
        P[s,a,:] /= P[s,a,:].sum()
V = valueIteration(R,P,0.97,0.001)

在值迭代中重复效用值（马尔可夫决策过程）

问题描述

1 个解决方案

解决方案1
0 2015-08-22 12:44:44

在值迭代中重复效用值（马尔可夫决策过程）

问题描述

1 个解决方案

解决方案1 0 2015-08-22 12:44:44

解决方案1
0 2015-08-22 12:44:44