简体繁体中英

Reward function for Policy Gradient Descent in Reinforcement Learning

原文 2018-06-29 00:29:35 4 1 reinforcement-learning/ policy-gradient-descent

I'm currently learning about Policy Gradient Descent in the context of Reinforcement Learning. TL;DR, my question is: "What are the constraints on the reward function (in theory and practice) and what would be a good reward function for the case below?"

Details: I want to implement a Neural Net which should learn to play a simple board game using Policy Gradient Descent. I'll omit the details of the NN as they don't matter. The loss function for Policy Gradient Descent, as I understand it is negative log likelihood: loss = - avg(r * log(p))

My question now is how to define the reward r ? Since the game can have 3 different outcomes: win, loss, or draw - it seems rewarding 1 for a win, 0 for a draw, -1 for a loss (and some discounted value of those for action leading to those outcomes) would be a natural choice.

However, mathematically I have doubts:

Win Reward: 1 - This seems to make sense. This should push probabilities towards 1 for moves involved in wins with diminishing gradient the closer the probability gets to 1.

Draw Reward: 0 - This does not seem to make sense. This would just cancel out any probabilities in the equation and no learning should be possible (as the gradient should always be 0).

Loss Reward: -1 - This should kind of work. It should push probabilities towards 0 for moves involved in losses. However, I'm concerned about the asymmetry of the gradient compared to the win case. The closer to 0 the probability gets, the steeper the gradient gets. I'm concerned that this would create an extremely strong bias towards a policy that avoids losses - to the degree where the win signal doesn't matter much at all.

1 answers

You are on the right track. However, I believe you are confusing rewards with action probabilities. In case of draw, it learns that the reward itself is zero at the end of the episode. However, in case of loss, the loss function is discounted reward (which should be -1) times the action probabilities. So it will get you more towards actions which end in win and away from loss with actions ending in draw falling in the middle. Intuitively, it is very similar to supervised deep learning only with an additional weighting parameter (reward) attached to it.

Additionally, I believe this paper from Google DeepMind would be useful for you: https://arxiv.org/abs/1712.01815 . They actually talk about solving the chess problem using RL.

Reinforcement Learning Policy Gradient two different update method with reward?

Loss Policy Gradient - Reinforcement Learning

What Loss Or Reward Is Backpropagated In Policy Gradients For Reinforcement Learning?

Negative reward in reinforcement learning

Plotting reward curve in reinforcement learning

Does agent need to know reward function in advance in Reinforcement Learning?

python policy gradient reinforcement learning with continous action space is not working

Reward is converging but actions are not correct in reinforcement learning

Efficient reward range in deep reinforcement learning

Keras Reinforcement Learning: How to pass reward to the model

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Reinforcement Learning Policy Gradient two different update method with reward? Loss Policy Gradient - Reinforcement Learning What Loss Or Reward Is Backpropagated In Policy Gradients For Reinforcement Learning? Negative reward in reinforcement learning Plotting reward curve in reinforcement learning Does agent need to know reward function in advance in Reinforcement Learning? python policy gradient reinforcement learning with continous action space is not working Reward is converging but actions are not correct in reinforcement learning Efficient reward range in deep reinforcement learning Keras Reinforcement Learning: How to pass reward to the model

Related Tags

Reward function for Policy Gradient Descent in Reinforcement Learning

Question

1 answers

solution1 1 2018-06-29 18:05:01

solution1
1 2018-06-29 18:05:01