简体   繁体   中英

Reinforcement Learning Policy Gradient two different update method with reward?

I was looking at 2 different examples for policy gradient, and was wondering why there are 2 different methods for getting the advantage function. One of them at https://github.com/flyyufelix/VizDoom-Keras-RL/blob/master/reinforce.py which sets the advantage straight to the discounted reward

advantages[i][self.actions[i]] = discounted_rewards[i]

and passes this straight to categorical cross entropy.

The other one at https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/contents/7_Policy_gradient_softmax/RL_brain.py which has

loss = tf.reduce_mean(neg_log_prob * self.tf_vt)

which he then tries to minimize (tf_vt is just discounted_reward, calculated in the same way as the one above).

Shouldn't the first one method be

advantages[i][self.actions[i]] = probability_from_softmax * discounted_rewards[i]

if they were to be the same? Or am I misunderstanding the difference between manually calculating negative log probabilities here?

I think its one and the same. They are just writing in two different ways. The first definition calculates the advantage function while the second one calculates the loss directly. Advantage function just gives how much this particular reward is advantageous as compared to the average reward (value). If you provide the value as zero, you get the first definition. Now, the loss needs to be calculated. So the categorical cross entropy gives the log probabilities multiplied by advantage. In the second definition, this entire thing is done manually.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM