I'm trying to learn the DQN using tensorflow. In my action spaces, I have valid and invalid actions for each state. I setup the q_target network as
t1 = tf.layers.dense(s_, 20, tf.nn.relu, w,b, name='t1')
q_next = tf.layers.dense(t1, n_actions, w,b, name='t2')
How can I make it work in tensorflow, such that
q_target = r + self.gamma * max(q_next(valid_actions))
For example:
q_target = [1, 2, 3;4, 5, 6],
valid_actions = [true,true,false;false,true,false],
output: max(q_next_valid) = [2;5]
Thank you!
Based on your example scenario.
You can try to implement this using tf.math.reduce_max()
method.
import tensorflow as tf # Tensorflow 2.1.0
q_target = [[1, 2, 3],[4, 5, 6]]
valid_actions = [[True,True,False],[False,True,False]]
valid_actions = tf.cast(valid_actions, dtype = tf.int32)
# output: max(q_next_valid) = [2;5]
tf.math.reduce_max(q_target*valid_actions, axis=1, keepdims=False) # <tf.Tensor: shape=(2,), dtype=int32, numpy=array([2, 5], dtype=int32)>
tf.math.reduce_max(q_target*valid_actions, axis=1, keepdims=True) # <tf.Tensor: shape=(2, 1), dtype=int32, numpy= array([[2],[5]], dtype=int32)>
I converted your Boolean into Integer so I can multiply it to the q_target
that will zero-out wrong values. And also when you set keepdims = True
it will retain the original Rank of the Tensor, hence when False
it will reduce the Rank of the Tensor by 1.
You can read more about tf.math.reduce_max()
in the documentation in this link .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.