My First LSTM RNN Loss Is Not Reducing As Expected

Question

I've been trying to look at RNN examples documentation and roll my very own simple RNN for sequence-to-sequence by using the tiny shakespeare corpus with outputs shifted by one character. I'm using sherjilozair's fantastic utils.py to load the data ( https://github.com/sherjilozair/char-rnn-tensorflow/blob/master/utils.py ) but my training run looks like this...

loading preprocessed files ('epoch ', 0, 'loss ', 930.27938270568848) ('epoch ', 1, 'loss ', 912.94828796386719) ('epoch ', 2, 'loss ', 902.99976110458374) ('epoch ', 3, 'loss ', 902.90720677375793) ('epoch ', 4, 'loss ', 902.87029957771301) ('epoch ', 5, 'loss ', 902.84992623329163) ('epoch ', 6, 'loss ', 902.83739829063416) ('epoch ', 7, 'loss ', 902.82908940315247) ('epoch ', 8, 'loss ', 902.82331037521362) ('epoch ', 9, 'loss ', 902.81916546821594) ('epoch ', 10, 'loss ', 902.81605243682861) ('epoch ', 11, 'loss ', 902.81366014480591)

I was expecting a much sharper dropoff, and even after 1000 epochs it's still around the same. I think there's something wrong with my code, but I can't see what. I've pasted the code below, if anyone could have a quick look over and see if anything stands out as odd I'd be very grateful, thank you.

#
# rays second predictor
#
# take basic example and convert to rnn
#

from tensorflow.examples.tutorials.mnist import input_data

import sys
import argparse
import pdb
import tensorflow as tf

from utils import TextLoader

def main(_):
    # break

    # number of hidden units
    lstm_size = 24

    # embedding of dimensionality 15 should be ok for characters, 300 for words
    embedding_dimension_size = 15

    # load data and get vocab size
    num_steps = FLAGS.seq_length
    data_loader = TextLoader(FLAGS.data_dir, FLAGS.batch_size, FLAGS.seq_length)
    FLAGS.vocab_size = data_loader.vocab_size

    # placeholder for batches of characters
    input_characters = tf.placeholder(tf.int32, [FLAGS.batch_size, FLAGS.seq_length])
    target_characters = tf.placeholder(tf.int32, [FLAGS.batch_size, FLAGS.seq_length])

    # create cell
    lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size, state_is_tuple=True)

    # initialize with zeros
    initial_state = state = lstm.zero_state(FLAGS.batch_size, tf.float32)

    # use embedding to convert ints to float array
    embedding = tf.get_variable("embedding", [FLAGS.vocab_size, embedding_dimension_size])
    inputs = tf.nn.embedding_lookup(embedding, input_characters)

    # flatten back to 2-d because rnn cells only deal with 2d
    inputs = tf.contrib.layers.flatten(inputs)

    # get output and (final) state
    outputs, final_state = lstm(inputs, state)

    # create softmax layer to classify outputs into characters
    softmax_w = tf.get_variable("softmax_w", [lstm_size, FLAGS.vocab_size])
    softmax_b = tf.get_variable("softmax_b", [FLAGS.vocab_size])
    logits = tf.nn.softmax(tf.matmul(outputs, softmax_w) + softmax_b)
    probs = tf.nn.softmax(logits)

    # expected labels will be 1-hot representation of last character of target_characters
    last_characters = target_characters[:,-1]
    last_one_hot = tf.one_hot(last_characters, FLAGS.vocab_size)

    # calculate loss
    cross_entropy = tf.nn.softmax_cross_entropy_with_logits(labels=last_one_hot, logits=logits)

    # calculate total loss as mean across all batches
    batch_loss = tf.reduce_mean(cross_entropy)

    # train using adam optimizer
    train_step = tf.train.AdagradOptimizer(0.3).minimize(batch_loss)

    # start session
    sess = tf.InteractiveSession()

    # initialize variables
    sess.run(tf.global_variables_initializer())

    # train!
    num_epochs = 1000
    # loop through epocs
    for e in range(num_epochs):
        # look through batches
        numpy_state = sess.run(initial_state)
        total_loss = 0.0
        data_loader.reset_batch_pointer()
        for i in range(data_loader.num_batches):
            this_batch = data_loader.next_batch()
                # Initialize the LSTM state from the previous iteration.
            numpy_state, current_loss, _ = sess.run([final_state, batch_loss, train_step], feed_dict={initial_state:numpy_state, input_characters:this_batch[0], target_characters:this_batch[1]})
            total_loss += current_loss
        # output total loss
        print("epoch ", e, "loss ", total_loss)

    # break into debug
    pdb.set_trace()

    # calculate accuracy using training set

if __name__ == '__main__':
  parser = argparse.ArgumentParser()
  parser.add_argument('--data_dir', type=str, default='data/tinyshakespeare',
                      help='Directory for storing input data')
  parser.add_argument('--batch_size', type=int, default=100,
                      help='minibatch size')
  parser.add_argument('--seq_length', type=int, default=50,
                      help='RNN sequence length')
  FLAGS, unparsed = parser.parse_known_args()
  tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)

Update July 20th.

Thank you for your replies. I updated this to use the dynamic RNN call to look like this...

outputs, final_state = tf.nn.dynamic_rnn(initial_state=initial_state, cell=lstm, inputs=inputs, dtype=tf.float32)

Which raises a few interesting questions... The batching seems to work through the data set picking blocks of 50-characters at a time then moving forward 50 characters to get the next sequence in the batch. If this is then used for training and you're calculating loss based on the predicted final character in the sequence against final character+1 then there's a whole 49 characters of prediction in each sequence the loss is never tested against. That seems a little odd.

Also, when testing the output I feed it a single character not 50, then get the prediction and feed that single character back in. Should I be adding to that single character every step? So the first seed is 1 character, then I add the predicted character so the next call is 2 characters in sequence, etc. up to a max of my training sequence length? Or does that not matter if I am passing back in the updated state? Ie, does the updated state represent all the preceding characters too?

On another point, I found what I think it the main reason for it not reducing... I was calling the softmax twice by mistake...

logits = tf.nn.softmax(tf.matmul(final_output, softmax_w) + softmax_b)
probs = tf.nn.softmax(logits)

Answer 1

Your function lstm() is only one cell and not a sequence of cells. For a sequence you create a sequence of lstms and then pass the sequence as input. By concatenating the embedding inputs and pass through a single cell won't work, instead you use dynamic_rnn method for a sequence.

And also softmax is applied twice, in the logits as well as in cross_entropy which needs to fixed.

My First LSTM RNN Loss Is Not Reducing As Expected

Question

1 answers

solution1
2 ACCPTED 2017-07-26 22:41:07

My First LSTM RNN Loss Is Not Reducing As Expected

Question

1 answers

solution1 2 ACCPTED 2017-07-26 22:41:07

solution1
2 ACCPTED 2017-07-26 22:41:07