强化学习 - 代理如何知道要选择哪个动作？

Question

I'm trying to understand Q-Learning 我正在努力理解Q-Learning

The basic update formula: 基本更新公式：

Q(st, at) += a[rt+1, + d.max(Q(st+1, a)) - Q(st,at)]

I understand the formula, and what it does, but my question is: 我理解公式及其作用，但我的问题是：

How does the agent know to choose Q(st, at)? 代理如何知道选择Q（st，at）？

I understand that an agent follows some policy π, but how do you create this policy in the first place? 我知道代理遵循一些政策π，但你如何创建这个政策呢？

My agents are playing checkers, so I am focusing on model-free algorithms. 我的代理人正在玩跳棋，所以我专注于无模型算法。
All the agent knows is the current state it is in. 所有代理都知道它所处的当前状态。
I understand that when it performs an action, you update the utility, but how does it know to take that action in the first place. 我知道当它执行操作时，您会更新该实用程序，但它知道如何首先采取该操作。

At the moment I have: 目前我有：

Check each move you could make from that state. 检查您可以从该状态进行的每个动作。
Pick whichever move has the highest utility. 选择具有最高效用的移动。
Update the utility of the move made. 更新移动的实用程序。

However, this doesnt really solve much, you still get stuck in local minimum/maximums. 但是，这并没有真正解决太多，你仍然陷入局部最小/最大。

So, just to round things off, my main question is: 所以，为了解决问题，我的主要问题是：

How, for an agent that knows nothing and is using a model-free algorithm, do you generate an initial policy, so it know which action to take? 对于一无所知且使用无模型算法的代理，您是如何生成初始策略的，因此它知道要采取哪种操作？

Answer 1

That update formula incrementally computes the expected value of each action in every state. 该更新公式逐步计算每个状态中每个操作的预期值。 A greedy policy chooses always the highest valued action. 贪婪的政策总是选择最有价值的行动。 This is the best policy when you have already learned the values. 当您已经学习了这些值时，这是最好的策略。 The most common policy for use during learning is the ε-greedy policy, which chooses the highest valued action with probability 1-ε, and a random action with probability ε. 在学习过程中最常用的策略是ε-贪婪策略，它选择概率为1-ε的最高值动作，以及概率为ε的随机动作。

强化学习 - 代理如何知道要选择哪个动作？

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-04-23 16:05:42

强化学习 - 代理如何知道要选择哪个动作？

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-04-23 16:05:42

解决方案1
0 已采纳 2016-04-23 16:05:42