简体   繁体   中英

Inconsistencies between tf.contrib.layer.fully_connected, tf.layers.dense, tf.contrib.slim.fully_connected, tf.keras.layers.Dense

I am trying to implement policy gradient for a contextual bandit problem ( https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-1-5-contextual-bandits-bff01d1aad9c ).

I am defining a model in tensorflow to solve this problem using a single fully-connected layer.

I am trying out different APIs from tensorflow, but want to avoid using the contrib package since it is not tensorflow-supported. I am interested in using the keras API since I am already familiar with the functional interface, and it is now implemented as tf.keras . However, I can only seem to get results to work when using tf.contrib.slim.fully_connected , or tf.contrib.layers.fully_connected (the former calls the latter).

The following two snippets work correctly ( one_hot_encoded_state_input and num_actions both adhere to the expected tensor shapes for the layers).

import tensorflow.contrib.slim as slim
action_probability_distribution = slim.fully_connected(
    one_hot_encoded_state_input, \
    num_actions, \     
    biases_initializer=None, \
    activation_fn=tf.nn.sigmoid, \
    weights_initializer=tf.ones_initializer())

and

from tensorflow.contrib.layers import fully_connected
action_probability_distribution = fully_connected(
    one_hot_encoded_state_input,
    num_actions,\
    biases_initializer=None, \
    activation_fn=tf.nn.sigmoid, \
    weights_initializer=tf.ones_initializer())

On the other hand, neither of the following work:

action_probability_distribution = tf.layers.dense(
    one_hot_encoded_state_input, \
    num_actions, \
    activation=tf.nn.sigmoid, \
    bias_initializer=None, \
    kernel_initializer=tf.ones_initializer())

nor

action_probability_distribution = tf.keras.layers.Dense(
    num_actions, \
    activation='sigmoid', \
    bias_initializer=None, \
    kernel_initializer = 'Ones')(one_hot_encoded_state_input)

The last two cases use tensorflow's high level APIs layers and keras . Ideally, I would like to know if I am incorrectly implementing the first two cases using the last two cases , and if the only issue I am having is that the latter two are not equivalent to the former two .

For completeness, here is the entire code needed to run this (Note: python 3.5.6 and tensorflow 1.12.0 were used).

import tensorflow as tf
import numpy as np
tf.reset_default_graph()

num_states = 3
num_actions = 4
learning_rate = 1e-3

state_input = tf.placeholder(shape=(None,),dtype=tf.int32, name='state_input')
one_hot_encoded_state_input = tf.one_hot(state_input, num_states)

# DOESN'T WORK
action_probability_distribution = tf.keras.layers.Dense(num_actions, activation='sigmoid', bias_initializer=None, kernel_initializer = 'Ones')(one_hot_encoded_state_input)

# WORKS
# import tensorflow.contrib.slim as slim
# action_probability_distribution = slim.fully_connected(one_hot_encoded_state_input,num_actions,\
#     biases_initializer=None,activation_fn=tf.nn.sigmoid,weights_initializer=tf.ones_initializer())

# WORKS
# from tensorflow.contrib.layers import fully_connected
# action_probability_distribution = fully_connected(one_hot_encoded_state_input,num_actions,\
#     biases_initializer=None,activation_fn=tf.nn.sigmoid,weights_initializer=tf.ones_initializer())

# DOESN'T WORK
# action_probability_distribution = tf.layers.dense(one_hot_encoded_state_input,num_actions, activation=tf.nn.sigmoid, bias_initializer=None, kernel_initializer=tf.ones_initializer())

action_probability_distribution = tf.squeeze(action_probability_distribution)
action_chosen = tf.argmax(action_probability_distribution)

reward_input = tf.placeholder(shape=(None,), dtype=tf.float32, name='reward_input')
action_input = tf.placeholder(shape=(None,), dtype=tf.int32, name='action_input')
responsible_weight = tf.slice(action_probability_distribution, action_input, [1])
loss = -(tf.log(responsible_weight)*reward_input)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
update = optimizer.minimize(loss)


bandits = np.array([[0.2,0,-0.0,-5],
                    [0.1,-5,1,0.25],
                    [-5,5,5,5]])

assert bandits.shape == (num_states, num_actions)

def get_reward(state, action): # the lower the value of bandits[state][action], the higher the likelihood of reward
    if np.random.randn() > bandits[state][action]:
        return 1
    return -1

max_episodes = 10000
epsilon = 0.1

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    rewards = np.zeros(num_states)
    for episode in range(max_episodes):
        state = np.random.randint(0,num_states)
        action = sess.run(action_chosen, feed_dict={state_input:[state]})
        if np.random.rand(1) < epsilon:
            action = np.random.randint(0, num_actions)

        reward = get_reward(state, action)
        sess.run([update, action_probability_distribution, loss], feed_dict = {reward_input: [reward], action_input: [action], state_input: [state]})

        rewards[state] += reward

        if episode%500 == 0:
            print(rewards)

When using the chunks commented # THIS WORKS , the agent learns and maximizes reward across all three states. On the other hand, those commented # THIS DOESN'T WORK# don't learn and typically converge extremely quickly to choosing one action. For example, working behaviour should print a reward list that is positive, increasing numbers (good cumulative reward for each state). non-working behaviour looks like a reward list that has only one action with increasing cumulative reward, usually sacrificing the other (negative cumulative reward).

For anyone who runs into this issue, especially since tensorflow has many APIs for implementation, the difference comes down to bias initialization and defaults. For tf.contrib and tf.slim , using biases_initializer = None means that no bias is used. Replicating this using tf.layers and tf.keras requires use_bias=False .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM