[英]Reinforcement Learning - How does an Agent know which action to pick?
I'm trying to understand Q-Learning 我正在努力理解Q-Learning
The basic update formula: 基本更新公式:
Q(st, at) += a[rt+1, + d.max(Q(st+1, a)) - Q(st,at)]
I understand the formula, and what it does, but my question is: 我理解公式及其作用,但我的问题是:
How does the agent know to choose Q(st, at)? 代理如何知道选择Q(st,at)?
I understand that an agent follows some policy π, but how do you create this policy in the first place? 我知道代理遵循一些政策π,但你如何创建这个政策呢?
At the moment I have: 目前我有:
However, this doesnt really solve much, you still get stuck in local minimum/maximums. 但是,这并没有真正解决太多,你仍然陷入局部最小/最大。
So, just to round things off, my main question is: 所以,为了解决问题,我的主要问题是:
How, for an agent that knows nothing and is using a model-free algorithm, do you generate an initial policy, so it know which action to take? 对于一无所知且使用无模型算法的代理,您是如何生成初始策略的,因此它知道要采取哪种操作?
That update formula incrementally computes the expected value of each action in every state. 该更新公式逐步计算每个状态中每个操作的预期值。 A greedy policy chooses always the highest valued action.
贪婪的政策总是选择最有价值的行动。 This is the best policy when you have already learned the values.
当您已经学习了这些值时,这是最好的策略。 The most common policy for use during learning is the ε-greedy policy, which chooses the highest valued action with probability 1-ε, and a random action with probability ε.
在学习过程中最常用的策略是ε-贪婪策略,它选择概率为1-ε的最高值动作,以及概率为ε的随机动作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.