I am trying to implement the value iteration algorithm of the Markov Decision Process using python. I have one implementation. But, this is giving me many repeated values for the utilities. My transition matrix is quite sparse. Probably, this is causing the problem. But, I am not very sure if this assumption is correct. How should I correct this? The code might be pretty shoddy. I am very new to value iteration. So please help me identify problems with my code. The reference code is this : http://carlo-hamalainen.net/stuff/mdpnotes/ . I have used the ipod_mdp.py code file. Here is the snippet of my implementation:
num_of_states = 470 #total number of states
#initialization
V1 = [0.25] * num_of_states
get_target_index = state_index[(u'48.137654', u'11.579949')] #each state is a location
#print "The target index is ", get_target_index
V1[get_target_index] = -100 #assigning least cost to the target state
V2 = [0.0] * num_of_states
policy = [0.0] * num_of_states
count = 0.0
while max([abs(V1[i] - V2[i]) for i in range(num_of_states)]) > 0.001:
print max([abs(V1[i] - V2[i]) for i in range(num_of_states)])
print count
for s in range(num_of_states): #for each state
#initialize minimum action to the first action in the list
min_action = actions_index[actions[0]] #initialize - get the action index for the first iteration
min_action_cost = cost[s, actions_index[actions[0]]] #initialize the cost
for w in range(num_of_states):
if (s, state_index[actions[0]], w) in transitions: #if this transition exists in the matrix - non-zero value
min_action_cost += 0.9 * transitions[s, state_index[actions[0]], w] * V1[w]
else:
min_action_cost += 0.9 * 0.001 * V1[w] #if not - give it a small value of 0.001 instead of 0.0
#get the minimum action cost for the state
for a in actions:
this_cost = cost[s, actions_index[a]]
for w in range(num_of_states):
# if index_state[w] != 'm':
if (s, state_index[a], w) in transitions:
this_cost += 0.9 * transitions[s, state_index[a], w] * V1[w]
else:
this_cost += 0.9 * 0.001 * V1[w]
if this_cost < min_action_cost:
min_action = actions_index[a]
min_action_cost = this_cost
V2[s] = min_action_cost
policy[s] = min_action
V1, V2 = V2, V1 #swap
count += 1
Thank you very much.
I am not sure I understand your code fully. I will just leave my implementation here in case someone needs it.
import numpy as np
def valueIteration(R, P, discount, threshold):
V = np.copy(R)
old_V = np.copy(V)
error = float("inf")
while error > threshold:
old_V, V = (V, old_V)
max_values = np.dot(P, old_V).max(axis=1)
np.copyto(V, R + discount * max_values)
error = np.linalg.norm(V - old_V)
return V
S = 30
A = 4
R = np.zeros(S)
# Goal state S-1
R[S-2] = 1
P = np.random.rand(S,A,S)
# Goal state goes to dwell state
P[S-2,:,:] = 0
P[S-2,:,S-1] = 1
P[S-1,:,:] = 0
P[S-1,:,S-1] = 1
for s in range(S-2): #goal and dwell states do not need normalization
for a in range(A):
P[s,a,:] /= P[s,a,:].sum()
V = valueIteration(R,P,0.97,0.001)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.