简体   繁体   中英

How do we assess each reward in the return in Policy Gradient Methods?

Hi StackOverflow Community,

I have a problem with the policy gradient methods in reinforcement learning.

In policy gradient methods, we increase/decrease the log probability of an action based on the return (ie total rewards) from that step onwards. So if our return is high, we increase it but I have problem at this step.

Let say that we have three rewards in our return. Although the sum of all these three rewards is high, the second reward is really bad.

How do we deal with this problem? How do we assess each reward separately? Is there an alternative version of this policy gradient methods?

This is a multi-objective problem , where the reward is not scalar but a vector. By definition, there is no single optimal policy in the classical sense, but there is a set of Pareto-optimal policies, ie, for which you cannot perform better wrt an objective (max sum of first reward, for instance) without losing something on the other objective (max sum of other rewards). There are many ways to approach multi-objective problems, both in optimization (often genetic algorithms ) and in RL. Naively, you could just apply a scalarization to the rewards by linear weighting, but that's really inefficient. More sophisticated approaches learn a manifold in policy parameters space (eg this ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM