简体   繁体   中英

Several dips in accumulated episodic rewards during training of a reinforcement learning agent

Hi I am training reinforcement learning agents for a control problem using PPO algorithm. I am tracking the accumulated rewards for each episode during the training process. Several times during the training process I see a sudden dip in the accumulated rewards. I am not able to figure out why this is happening or how to avoid this. Tried with changing some of the hyper parameters like changing the number of neurons in the neural network layers, learning rate etc.. but still I see this happening consistently. If I debug and check the actions that are being taken during dips, obviously actions are very bad hence causing a decrease in rewards.

Can some one help me with understanding why this is happening or how to avoid this ?

Some of plots of my training process

在此处输入图片说明

在此处输入图片说明

在此处输入图片说明

I recently read this paper: https://arxiv.org/pdf/1805.07917.pdf I haven't used this method in particular, so I can't really vouch for the usefulness, but the explanation to this problem seemed convincing to me:

For instance, during the course of learning, the cheetah benefits from leaning forward to increase its speed which gives rise to a strong gradient in this direction. However, if the cheetah leans too much, it falls over. The gradient-based methods seem to often fall into this trap and then fail to recover as the gradient information from the new state has no guarantees of undoing the last gradient update.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM