简体   繁体   中英

Reward function for Policy Gradient Descent in Reinforcement Learning

I'm currently learning about Policy Gradient Descent in the context of Reinforcement Learning. TL;DR, my question is: "What are the constraints on the reward function (in theory and practice) and what would be a good reward function for the case below?"

Details: I want to implement a Neural Net which should learn to play a simple board game using Policy Gradient Descent. I'll omit the details of the NN as they don't matter. The loss function for Policy Gradient Descent, as I understand it is negative log likelihood: loss = - avg(r * log(p))

My question now is how to define the reward r ? Since the game can have 3 different outcomes: win, loss, or draw - it seems rewarding 1 for a win, 0 for a draw, -1 for a loss (and some discounted value of those for action leading to those outcomes) would be a natural choice.

However, mathematically I have doubts:

Win Reward: 1 - This seems to make sense. This should push probabilities towards 1 for moves involved in wins with diminishing gradient the closer the probability gets to 1.

Draw Reward: 0 - This does not seem to make sense. This would just cancel out any probabilities in the equation and no learning should be possible (as the gradient should always be 0).

Loss Reward: -1 - This should kind of work. It should push probabilities towards 0 for moves involved in losses. However, I'm concerned about the asymmetry of the gradient compared to the win case. The closer to 0 the probability gets, the steeper the gradient gets. I'm concerned that this would create an extremely strong bias towards a policy that avoids losses - to the degree where the win signal doesn't matter much at all.

You are on the right track. However, I believe you are confusing rewards with action probabilities. In case of draw, it learns that the reward itself is zero at the end of the episode. However, in case of loss, the loss function is discounted reward (which should be -1) times the action probabilities. So it will get you more towards actions which end in win and away from loss with actions ending in draw falling in the middle. Intuitively, it is very similar to supervised deep learning only with an additional weighting parameter (reward) attached to it.

Additionally, I believe this paper from Google DeepMind would be useful for you: https://arxiv.org/abs/1712.01815 . They actually talk about solving the chess problem using RL.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM