简体   繁体   中英

Has anyone managed to make Asynchronous advantage actor critic work with Mujoco experiments?

I'm using an open source version of a3c implementation in Tensorflow which works reasonably well for atari 2600 experiments. However, when I modify the network for Mujoco, as outlined in the paper , the network refuses to learn anything meaningful. Has anyone managed to make any open source implementations of a3c work with continuous domain problems, for example mujoco?

I have done a continuous action of Pendulum and it works well.

Firstly, you will build your neural network and output mean (mu) and standard deviation (sigma) for selecting an action.

The essential part of the continuous action is to include a normal distribution. I'm using tensorflow, so the code is looks like:

normal_dist = tf.contrib.distributions.Normal(mu, sigma)
log_prob = normal_dist.log_prob(action)
exp_v = log_prob * td_error
entropy = normal_dist.entropy()  # encourage exploration
exp_v = tf.reduce_sum(0.01 * entropy + exp_v)
actor_loss = -exp_v  

When you wanna sample an action, use the function tensorflow gives:

sampled_action = normal_dist.sample(1)

The full code of Pendulum can be found in my Github. https://github.com/MorvanZhou/tutorials/blob/master/Reinforcement_learning_TUT/10_A3C/A3C_continuous_action.py

I was hung up on this for a long time, hopefully this helps someone in my shoes:

Advantage Actor-critic in discrete spaces is easy: if your actor does better than you expect, increase the probability of doing that move. If it does worse, decrease it.

In continuous spaces though, how do you do this? The entire vector your policy function outputs is your move -- if you are on-policy, and you do better than expected, there's no way of saying "let's output that action even more!" because you're already outputting exactly that vector.

That's where Morvan's answer comes into play. Instead of outputting just an action, you output a mean and a std-dev for each output-feature. To choose an action, you pass your inputs in to create a mean/stddev for each output-feature, and then sample each feature from this normal distribution.

If you do well, you adjust the weights of your policy network to change the mean/stddev to encourage this action. If you do poorly, you do the opposite.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM