简体   繁体   English

Tensorflow 强化学习 RNN 在使用 GradientTape 优化后返回 NaN

[英]Tensorflow Reinforcement Learning RNN returning NaN's after Optimization with GradientTape

def create_example_model():
    tf.keras.backend.set_floatx('float64')
    model = Sequential()
    model.add(LSTM(128, input_shape=((60, len(df_train.columns)))))

    model.add(Dense(64, activation='relu'))

    model.add(Dense(3, activation=None))

    return model

def choose_action(model, observation):
    observation = np.expand_dims(observation, axis=0)

    logits = model.predict(observation)

    prob_weights = tf.nn.softmax(logits).numpy()

    action = np.random.choice(3, size=1, p=prob_weights.flatten())[0]

    return action

def train_step(model, optimizer, observations, actions, discounted_rewards):
    with tf.GradientTape() as tape:

        logits = model(observations)

        loss = compute_loss(logits, actions, discounted_rewards)

        grads = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

learning_rate = 1e-3
optimizer = tf.keras.optimizers.Adam(learning_rate)
env = TradingEnv(rnn_ready_array)

model = create_example_model()
memory = Memory()
info_list = []

for i_episode in range(10):
    observation = env.reset()
    memory.clear()

    while True:
        action = choose_action(model, observation)
        next_observation, reward, done, info = env.step(action)
        info_list.append(info)
        memory.add_to_memory(observation, action, reward)
        if done:
            total_reward = sum(memory.rewards)
            train_step(model, optimizer,
                 observations=np.array(memory.observations),
                 actions=np.array(memory.actions),
                 discounted_rewards = discount_rewards(memory.rewards))

            memory.clear()
            break
        observation = next_observation

I am working on a reinforcement learning project with Tensorflow 2.0;我正在使用 Tensorflow 2.0 进行强化学习项目; the format of the code comes from an online MIT course of which I am attempting to adapt to my own project.代码格式来自麻省理工学院的一个在线课程,我正在尝试适应我自己的项目。 I am new to Tensorflow 2.0 and I can't glean from the documentation why this problem is occurring.我是 Tensorflow 2.0 的新手,我无法从文档中了解为什么会出现此问题。 The issue is that when I run the reinforcement learning process,问题是当我运行强化学习过程时,

  1. The first episode will always complete successfully.第一集总是会成功完成。
  2. A new observation will always be generated from the model successfully. model 总是会成功生成新的观察结果。
  3. During the second episode, the network will always output: [NaN, NaN, NaN]在第二集中,网络总是会 output: [NaN, NaN, NaN]

Some debugging info I have found that should be helpful: If I comment out the optimization lines 'grads = tape.gradient(...)' and 'optimizer.apply_gradients(...)' the script will run to completion error free (though it is obviously not doing anything useful without optimization).我发现一些调试信息应该会有所帮助:如果我注释掉优化行 'grads = tape.gradient(...)' 和 'optimizer.apply_gradients(...)' 脚本将运行完成无错误(尽管没有优化它显然没有做任何有用的事情)。 This indicates to me the optimization process is changing the model in a way that is causing the problem.这向我表明优化过程正在以导致问题的方式更改 model。 I've tried to include only the necessary functions for debugging;我试图只包含调试所需的功能; if there is any further information one might need for debugging, I'd be happy to add additional info in an edit.如果调试时可能需要任何进一步的信息,我很乐意在编辑中添加其他信息。

After hours of checking and rechecking various containers I realized it was the discounted rewards function that was not working properly, returning NaN in this circumstance.经过数小时检查和重新检查各种容器后,我意识到打折奖励 function 无法正常工作,在这种情况下返回 NaN。 Issue resolved:)问题解决了:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM