[英]Enhancement of Agent Training Q Learning Taxi V3
episode_number = 10000
for i in range(1,episode_number):
state = env.reset()
reward_count = 0
dropouts = 0
while True:
if random.uniform(0,1) < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(q_table[state])
next_state, reward, done, _ = env.step(action)
old_value = q_table[state, action]
next_max = np.max(q_table[next_state])
next_value = (1-alpha)*old_value + alpha*(reward + gamma*next_max)
q_table[state,action] = next_value
state = next_state
if reward == -10:
dropouts += 1
if done:
break
reward_count += reward
if i%10 == 0:
dropout_list.append(dropouts)
reward_list.append(reward_count)
print("Episode: {}, reward {}, wrong dropout {}".format(i, reward_count,dropouts))
I was required to enhance this code to showcase a comparison of reward and penalties.我被要求增强此代码以展示奖励和惩罚的比较。 How it works is, I have to enhance it by making this code display a comparison of rewards earned before training agent and after training agent.
它是如何工作的,我必须通过让这段代码显示训练代理之前和训练代理之后获得的奖励的比较来增强它。 The graph plotted must overlap to show comparison but I could not find a way.
绘制的图表必须重叠以显示比较,但我找不到方法。 I have been trying for days but could not find the solution I am looking for.
我已经尝试了好几天,但找不到我正在寻找的解决方案。 I hope someone can help assist me in this.
我希望有人可以帮助我。
If there is a need to create a new code or a separate code then compare the results, please do let me know.如果需要创建新代码或单独的代码然后比较结果,请告诉我。 thank you.
谢谢你。
I think there is a missing term in the affectation of next_value It should be next_value = (1-alpha) old_value + alpha (reward + gamma*next_max- q_table(state,action))我认为在next_value的做作中缺少一个术语应该是next_value = (1-alpha) old_value + alpha (reward + gamma*next_max- q_table(state,action))
Regarding the plots you want to make, you can interactively plot the rewards earned by an agent taking random actions simultaneously with the rewards taken by your agent after reinforcement learning关于您要制作的情节,您可以交互 plot 代理人同时采取随机行动获得的奖励与强化学习后代理人获得的奖励
I doesn't seem that understood but the code you are showing is the learning phase of the agent我似乎不明白,但您显示的代码是代理的学习阶段
After you run it q_table contains the quality of each action in regard to the current state运行后,q_table 包含与当前 state 相关的每个操作的质量
The algorithm for the progression of the agent is then代理进程的算法是
initialize environment
done := false
while not done
s:= current state
a := argmax(q_table[s])
update s and done by making the action a
I suggest you check this tutorial that covers all of your interrogations I think我建议您查看本教程,该教程涵盖了我认为的所有审讯
https://www.learnpythonwithrune.org/capstone-project-reinforcement-learning-from-scratch-with-python/ https://www.learnpythonwithrune.org/capstone-project-reinforcement-learning-from-scratch-with-python/
Feel free to check the comment section of the post for the concerns regarding the plots随时查看帖子的评论部分,了解有关情节的担忧
I hope I have been helpful我希望我有帮助
Good luck in your work!祝你工作顺利!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.