简体繁体 English

强化学习中政策梯度下降的奖励功能

[英]Reward function for Policy Gradient Descent in Reinforcement Learning

原文 2018-06-29 00:29:35 7 1 reinforcement-learning/ policy-gradient-descent

I'm currently learning about Policy Gradient Descent in the context of Reinforcement Learning. 我目前正在强化学习的背景下学习有关“政策梯度下降”的知识。 TL;DR, my question is: "What are the constraints on the reward function (in theory and practice) and what would be a good reward function for the case below?" TL; DR，我的问题是： “在下面的情况下，对奖励函数的约束是什么（在理论和实践上），什么是好的奖励函数？”

Details: I want to implement a Neural Net which should learn to play a simple board game using Policy Gradient Descent. 详细信息：我想实现一个神经网络，该网络应该学会使用Policy Gradient Descent玩简单的棋盘游戏。 I'll omit the details of the NN as they don't matter. 我将忽略NN的细节，因为它们无关紧要。 The loss function for Policy Gradient Descent, as I understand it is negative log likelihood: loss = - avg(r * log(p)) 据我了解，Policy Gradient Descent的损失函数为对数似然可能性： loss = - avg(r * log(p))

My question now is how to define the reward r ? 我现在的问题是如何定义奖励r ？ Since the game can have 3 different outcomes: win, loss, or draw - it seems rewarding 1 for a win, 0 for a draw, -1 for a loss (and some discounted value of those for action leading to those outcomes) would be a natural choice. 由于游戏可以具有3种不同的结果：获胜，失败或平局-似乎为获胜奖励1，为平局奖励0，为损失奖励-1（以及那些导致这些结果的行动的折现值）一个自然的选择。

However, mathematically I have doubts: 但是，从数学上我有疑问：

Win Reward: 1 - This seems to make sense. 赢赏：1-这似乎很有意义。 This should push probabilities towards 1 for moves involved in wins with diminishing gradient the closer the probability gets to 1. 概率越接近1，则梯度逐渐减小的获胜移动中的概率就应该将其推向1。

Draw Reward: 0 - This does not seem to make sense. 开奖奖励：0-这似乎没有道理。 This would just cancel out any probabilities in the equation and no learning should be possible (as the gradient should always be 0). 这只会抵消方程中的任何概率，并且不可能进行学习（因为梯度应始终为0）。

Loss Reward: -1 - This should kind of work. 损失奖励：-1-这应该是一种工作。 It should push probabilities towards 0 for moves involved in losses. 对于涉及损失的移动，它应将概率推向0。 However, I'm concerned about the asymmetry of the gradient compared to the win case. 但是，与获胜案例相比，我担心坡度的不对称性。 The closer to 0 the probability gets, the steeper the gradient gets. 概率越接近0，则梯度越陡。 I'm concerned that this would create an extremely strong bias towards a policy that avoids losses - to the degree where the win signal doesn't matter much at all. 我担心这将对避免损失的政策产生极大的偏见-致使获胜信号无关紧要。

1 个解决方案

You are on the right track. 您走在正确的轨道上。 However, I believe you are confusing rewards with action probabilities. 但是，我相信您正在将奖励与行动概率相混淆。 In case of draw, it learns that the reward itself is zero at the end of the episode. 在平局的情况下，它得知情节结束时奖励本身为零。 However, in case of loss, the loss function is discounted reward (which should be -1) times the action probabilities. 但是，在发生损失的情况下，损失函数是折现奖励（应为-1）乘以行动概率。 So it will get you more towards actions which end in win and away from loss with actions ending in draw falling in the middle. 因此，它将使您更多地朝着以胜利告终而远离损失，以平局告终的行动落在中间的行动。 Intuitively, it is very similar to supervised deep learning only with an additional weighting parameter (reward) attached to it. 直观地讲，它与监督式深度学习非常相似，只是附加了一个加权参数（奖励）。

Additionally, I believe this paper from Google DeepMind would be useful for you: https://arxiv.org/abs/1712.01815 . 此外，我相信Google DeepMind的这篇论文对您会有所帮助： https : //arxiv.org/abs/1712.01815 。 They actually talk about solving the chess problem using RL. 他们实际上在谈论使用RL解决国际象棋问题。