简体   繁体   中英

Enhancement of Agent Training Q Learning Taxi V3

episode_number = 10000

for i in range(1,episode_number):
    

    state = env.reset()
    
    reward_count = 0
    dropouts = 0
    
    while True:
        
        if random.uniform(0,1) < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(q_table[state])

        next_state, reward, done, _ = env.step(action)
        
        
        old_value = q_table[state, action]  
        next_max = np.max(q_table[next_state]) 
        
        next_value = (1-alpha)*old_value + alpha*(reward + gamma*next_max) 

        q_table[state,action] = next_value
        

        state = next_state
        

        if reward == -10:
            dropouts += 1
            
        if done:
            break
        
        reward_count  += reward
    if i%10 == 0:
        
        dropout_list.append(dropouts)
        reward_list.append(reward_count)
        print("Episode: {}, reward {}, wrong dropout {}".format(i, reward_count,dropouts))

I was required to enhance this code to showcase a comparison of reward and penalties. How it works is, I have to enhance it by making this code display a comparison of rewards earned before training agent and after training agent. The graph plotted must overlap to show comparison but I could not find a way. I have been trying for days but could not find the solution I am looking for. I hope someone can help assist me in this.

If there is a need to create a new code or a separate code then compare the results, please do let me know. thank you.

I think there is a missing term in the affectation of next_value It should be next_value = (1-alpha) old_value + alpha (reward + gamma*next_max- q_table(state,action))

Regarding the plots you want to make, you can interactively plot the rewards earned by an agent taking random actions simultaneously with the rewards taken by your agent after reinforcement learning

I doesn't seem that understood but the code you are showing is the learning phase of the agent

After you run it q_table contains the quality of each action in regard to the current state

The algorithm for the progression of the agent is then

initialize environment 
done := false
while not done                      
    s:= current state                         
    a := argmax(q_table[s])                            
    update s and done by making the action a

I suggest you check this tutorial that covers all of your interrogations I think

https://www.learnpythonwithrune.org/capstone-project-reinforcement-learning-from-scratch-with-python/

Feel free to check the comment section of the post for the concerns regarding the plots

I hope I have been helpful

Good luck in your work!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM