简体   繁体   English

强化学习 - 代理如何知道要选择哪个动作?

[英]Reinforcement Learning - How does an Agent know which action to pick?

I'm trying to understand Q-Learning 我正在努力理解Q-Learning

The basic update formula: 基本更新公式:

Q(st, at) += a[rt+1, + d.max(Q(st+1, a)) - Q(st,at)]

I understand the formula, and what it does, but my question is: 我理解公式及其作用,但我的问题是:

How does the agent know to choose Q(st, at)? 代理如何知道选择Q(st,at)?

I understand that an agent follows some policy π, but how do you create this policy in the first place? 我知道代理遵循一些政策π,但你如何创建这个政策呢?

  • My agents are playing checkers, so I am focusing on model-free algorithms. 我的代理人正在玩跳棋,所以我专注于无模型算法。
  • All the agent knows is the current state it is in. 所有代理都知道它所处的当前状态。
  • I understand that when it performs an action, you update the utility, but how does it know to take that action in the first place. 我知道它执行操作时,您会更新该实用程序,但它知道如何首先采取该操作。

At the moment I have: 目前我有:

  • Check each move you could make from that state. 检查您可以从该状态进行的每个动作。
  • Pick whichever move has the highest utility. 选择具有最高效用的移动。
  • Update the utility of the move made. 更新移动的实用程序。

However, this doesnt really solve much, you still get stuck in local minimum/maximums. 但是,这并没有真正解决太多,你仍然陷入局部最小/最大。

So, just to round things off, my main question is: 所以,为了解决问题,我的主要问题是:

How, for an agent that knows nothing and is using a model-free algorithm, do you generate an initial policy, so it know which action to take? 对于一无所知且使用无模型算法的代理,您是如何生成初始策略的,因此它知道要采取哪种操作?

That update formula incrementally computes the expected value of each action in every state. 该更新公式逐步计算每个状态中每个操作的预期值。 A greedy policy chooses always the highest valued action. 贪婪的政策总是选择最有价值的行动。 This is the best policy when you have already learned the values. 当您已经学习了这些值时,这是最好的策略。 The most common policy for use during learning is the ε-greedy policy, which chooses the highest valued action with probability 1-ε, and a random action with probability ε. 在学习过程中最常用的策略是ε-贪婪策略,它选择概率为1-ε的最高值动作,以及概率为ε的随机动作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在强化学习中,agent 需要提前知道奖励函数吗? - Does agent need to know reward function in advance in Reinforcement Learning? 如何让强化学习者学习无尽的跑步者? - How to make an reinforcement learning agent learn an endless runner? 强化学习中的多维动作空间 - Multidimensional Action Space in Reinforcement Learning 深度强化学习-如何应对动作空间中的界限 - Deep reinforcement learning - how to deal with boundaries in action space 如何将强化学习应用于连续动作空间? - How can I apply reinforcement learning to continuous action spaces? 在强化学习中将离散动作转换为连续动作 - Transfer Discrete action to Continuous action in Reinforcement Learning 如何保存经过训练的强化学习代理以避免每次都对其进行训练? - How can i save a trained reinforcement learning agent to avoid training it each time? 强化学习 - 当游戏的输入只有像素时,我们如何决定对代理的奖励? - Reinforcement Learning - How to we decide the reward to the agent when the input to the game is only pixels? 在任意较大的动作/状态空间中进行强化学习 - Reinforcement Learning in arbitrarily large action/state spaces 增强学习以获取连续的状态和动作空间 - Reinforcement learning for continuous state and action space
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM