简体繁体 English

强化学习还是监督学习？

[英]Reinforcement Learning or Supervised Learning?

原文 2018-11-13 23:35:13 9 5 reinforcement-learning/ supervised-learning

如果在强化学习 (RL) 算法在现实世界中工作之前需要在模拟环境中进行大量迭代，为什么我们不使用相同的模拟环境来生成标记数据，然后使用监督学习方法而不是 RL？

5 个解决方案

The reason is because the two fields has a fundamental difference:究其原因，是因为这两个领域有着根本的区别：

One tries to replicate previous results and the other tries to be better than previous results.一个尝试复制以前的结果，另一个尝试比以前的结果更好。

There are 4 fields in machine learning:机器学习有4个领域：

Supervised learning监督学习
Unsupervised Learning无监督学习
Semi-supervised Learning半监督学习
Reinforcement learning强化学习

Let's talking about the two fields you asked for, and let's intuitively explore them with a real life example of archery.说一下你问的两个领域，我们用射箭这个现实生活中的例子来直观地探索它们。

Supervised Learning监督学习

For supervised learning, we would observe a master archer in action for maybe a week and record how far they pulled the bow string back, angle of shot, etc. And then we go home and build a model.对于监督学习，我们可能会观察一位弓箭手的动作，并记录他们将弓弦拉回多远、射击角度等。然后我们回家建立模型。 In the most ideal scenario , our model becomes equally as good as the master archer.在最理想的情况下，我们的模型变得不如主射手同样。 It cannot get better because the loss function in supervised learning is usually MSE or Cross entropy, so we simply try to replicate the feature label mapping.它不会变得更好，因为监督学习中的损失函数通常是 MSE 或交叉熵，所以我们只是尝试复制特征标签映射。 After building the model, we deploy it.构建模型后，我们部署它。 And let's just say we're extra fancy and make it learn online.让我们说我们特别花哨，让它在线学习。 So we continually take data from the master archer and continue to learn to be exactly the same as the master archer.所以我们不断地从弓箭大师那里获取数据，并继续学习与弓箭大师完全一样。

The biggest takeaway:最大的收获：

We're trying to replicate the master archer simply because we think he is the best.我们试图复制大师弓箭手只是因为我们认为他是最好的。 Therefore we can never beat him.所以我们永远打不过他。

Reinforcement Learning强化学习

In reinforcement learning, we simply build a model and let it try many different things.在强化学习中，我们只是建立一个模型并让它尝试许多不同的事情。 And we give it a reward / penalty depending on how far the arrow was from the bullseye.我们根据箭头离靶心的距离给予奖励/惩罚。 We are not trying to replicate any behaviour, instead, we try to find our own optimal behaviour.我们不是试图复制任何行为，而是试图找到我们自己的最佳行为。 Because of this, we are not given any bias towards what we think the optimal shooting strategy is.因此，我们不会对我们认为的最佳射击策略有任何偏见。

Because RL does not have any prior knowledge, it may be difficult for RL to converge on difficult problems.由于 RL 没有任何先验知识，因此 RL 可能难以收敛于困难的问题。 Therefore, there is a method called apprenticeship learning / imitation learning, where we basically give the RL some trajectories of master archers just so it can have a starting point and begin to converge.因此，有一种方法叫做学徒学习/模仿学习，我们基本上给 RL 一些弓箭手的轨迹，以便它可以有一个起点并开始收敛。 But after that, RL will explore by taking random actions sometimes to try to find other optimal solutions.但在那之后，RL将采取随机的行动有时会尽力去找其他的最优解探索。 This is something that supervised learning cannot do.这是监督学习无法做到的。 Because if you explore using supervised learning, you are basically saying by taking this action in this state is optimal.因为如果您使用监督学习进行探索，您基本上是说在这种状态下采取此行动是最佳的。 Then you try to make your model replicate it.然后你尝试让你的模型复制它。 But this scenario is wrong in supervised learning, and should instead be seen as an outlier in the data.但是这种情况在监督学习中是错误的，应该被视为数据中的异常值。

Key differences of Supervised learning vs RL:监督学习与强化学习的主要区别：

Supervised Learning replicates what's already done监督学习复制已经完成的工作
Reinforcement learning can explore the state space, and do random actions.强化学习可以探索状态空间，并进行随机动作。 This then allows RL to be potentially better than the current best.这使得 RL 可能比当前最好的更好。

Why we don't use the same simulated environment to generate the labeled data and then use supervised learning methods instead of RL为什么我们不使用相同的模拟环境来生成标记数据，然后使用监督学习方法而不是 RL

We do this for Deep RL because it has an experience replay buffer.我们为 Deep RL 这样做是因为它有一个经验回放缓冲区。 But this is not possible for supervised learning because the concept of reward is lacking.但这对于监督学习是不可能的，因为缺乏奖励的概念。

Example: Walking in a maze.例子：走迷宫。

Reinforcement Learning强化学习

Taking a right in square 3: Reward = 5在方格 3 中向右转：奖励 = 5

Taking a left in square 3: Reward = 0在方格 3 中左转：奖励 = 0

Taking a up in square 3: Reward = -5在方格 3 中上升：奖励 = -5

Supervised Learning监督学习

Taking a right in square 3在第 3 广场右转

Taking a left in square 3在广场 3 左转

Taking a up in square 3在方格 3 中向上

When you try to make a decision in square 3, RL will know to go right.当您尝试在第 3 方格中做出决定时，RL 会知道要正确。 Supervised learning will be confused, because in one example, your data said to take a right in square 3, 2nd example says to take left, 3rd example says to go up.监督学习会混淆，因为在一个例子中，你的数据说在方格 3 中向右走，第二个例子说向左走，第三个例子说往上走。 So it will never converge.所以它永远不会收敛。

In short, supervised learning is passive learning, that is, all the data is collected before you start training your model.简而言之，监督学习是被动学习，即在开始训练模型之前收集所有数据。

However, reinforcement learning is active learning.然而，强化学习是主动学习。 In RL, usually, you don't have much data at first and you collect new data as you are training your model.在 RL 中，通常一开始没有太多数据，然后在训练模型时收集新数据。 Your RL algorithm and model decide what specific data samples you can collect while training.您的 RL 算法和模型决定了您在训练时可以收集哪些特定数据样本。

In supervised learning we have target labelled data which is assumed to be correct.在监督学习中，我们有被假定为正确的目标标记数据。

In RL that's not the case we have nothing but rewards.在RL 中，情况并非如此，我们只有奖励。 Agents needs to figure itself which action to take by playing with the environment while observing the rewards it gets.代理需要在观察环境获得的奖励的同时，通过与环境玩耍来确定自己要采取的行动。

Supervised Learning is about the generalization of the knowledge given by the supervisor (training data) to use in an uncharted area (test data).监督学习是将监督者提供的知识（训练数据）泛化到未知领域（测试数据）。 It is based on instructive feedback where the agent is provided with correct actions (labels) to take given a situation (features).它基于指导性反馈，其中为代理提供正确的动作（标签）以在给定的情况（特征）下采取。

Reinforcement Learning is about learning through interaction by trial-and-error.强化学习是通过试错的交互来学习。 There is no instructive feedback but only evaluative feedback that evaluates the action taken by an agent by informing how good the action taken was instead of saying the correct action to take.没有指导性反馈，只有评估性反馈，通过告知采取的行动有多好而不是说要采取的正确行动来评估代理采取的行动。

Reinforcement learning is an area of Machine Learning.强化学习是机器学习的一个领域。 It is about taking suitable action to maximize reward in a particular situation.它是关于采取适当的行动以在特定情况下最大化奖励。 It is employed by various software and machines to find the best possible behavior or path it should take in a specific situation.它被各种软件和机器用来寻找在特定情况下应该采取的最佳行为或路径。 Reinforcement learning differs from supervised learning in a way that in supervised learning the training data has the answer key with it so the model is trained with the correct answer itself whereas in reinforcement learning, there is no answer but the reinforcement agent decides what to do to perform the given task.强化学习与监督学习的不同之处在于，在监督学习中，训练数据具有答案密钥，因此模型本身使用正确答案进行训练，而在强化学习中，没有答案，但强化代理决定要做什么执行给定的任务。 In the absence of a training data set, it is bound to learn from its experience.在没有训练数据集的情况下，它必然会从它的经验中学习。