Multicell LSTM RNN returns nan training error

Question

I'm trying to train this Multi Cell RNN network (for training, you can ignore the m_t+1 -> m_t part)

that uses 4 LSTM cell layers. Encoder and Decoder are just fully connected layers. G_t and m_t are groups of floats of size 6, 69 respectively. P_t, m_t+1 are also that size as well. The time step of this RNN is 48. But for somewhat reason, my training won't work AT ALL. I am dying to know what's wrong with my code.

The cost function looks like this

n_steps = 48
n_neurons = 512
n_layers = 4
NUM_OF_INPUTS = 6 + 69
NUM_OF_OUTPUTS = 6 + 69
EPOCHS = 50
sample_size = 12494
batch_size = 128
total_batch = int(sample_size / batch_size)
global_step = tf.Variable(0, trainable=False)
prop_valid = 0.1
time_stamp = 48

def mini_batch(data, bs, i):
    return data[i*bs : i*bs+bs,:,:]

#Both X_data_np and Y_data_np are three dimensional, which is the required dimension for the inputs of tf.nn.dynamic_rnn
X_data_np = np.load('X_data.npy')
Y_data_np = np.load('Y_data.npy')
data = np.concatenate([X_data_np, Y_data_np], axis=-1)
np.random.shuffle(data)
#standardize data
mean = np.mean(data)
data = data - mean
std = np.std(data)
data = data / std

train_size = int(sample_size * (1 - prop_valid))
valid_size = int((sample_size - train_size))

train_input = data[:train_size, :, :NUM_OF_INPUTS]
train_label = data[:train_size, :, NUM_OF_INPUTS:]
valid_input = data[train_size:train_size + valid_size, :,:NUM_OF_INPUTS]
valid_label = data[train_size:train_size + valid_size, :,NUM_OF_INPUTS:]

X = tf.placeholder(tf.float32, [None, n_steps, NUM_OF_INPUTS])
Y = tf.placeholder(tf.float32, [None, n_steps, NUM_OF_OUTPUTS])
encoded_inputs = tf.layers.dense(X, 256)
layers = [tf.contrib.rnn.LSTMCell(num_units = n_neurons, activation=tf.nn.tanh) for layer in range(n_layers)]
multi_layer_cell = tf.contrib.rnn.MultiRNNCell(layers)
outputs, _ = tf.nn.dynamic_rnn(multi_layer_cell, encoded_inputs, dtype=tf.float32)
prediction = tf.layers.dense(outputs, NUM_OF_OUTPUTS) 

Y = tf.placeholder(tf.float32, [None, n_steps, NUM_OF_OUTPUTS]) #(?, 48, 75)
distance = tf.norm(prediction[:,:,6:75] - Y[:,:,6:75], axis = 2)  # (?, 48)
distance_square = tf.square(distance)
#Add all the sum
reduced_distance = tf.math.reduce_sum(distance_square, axis= 1)  # (?, )
#Mean of all mini batch data
train_loss = tf.math.reduce_mean(reduced_distance, axis= 0) # ()

learning_rate = 0.001
trainOptimizer = tf.train.AdamOptimizer(learning_rate).minimize(train_loss, global_step=global_step)

sess = tf.Session()
tf.global_variables_initializer().run(session=sess)

for epoch in range(EPOCHS):
    for batch_idx in range(total_batch):
        train_batch_input = mini_batch(train_input, batch_size, batch_idx)
        train_batch_label = mini_batch(train_label, batch_size, batch_idx)
        _, loss= sess.run([trainOptimizer, train_loss], feed_dict={X:train_batch_input,Y:train_batch_label})
    if (epoch+1) % 10 == 0:
        prediction2 = sess.run(prediction, feed_dict={X:valid_input})
        valid_error = np.mean(np.sum(np.square(np.linalg.norm(prediction2[:,:,6:75] - valid_label[:,:,6:75], axis = 2)), axis = 1), axis = 0)
        print("Epoch: %05d tL: %.4f vE: %.4f" % (epoch+1, loss, valid_error))

The result is as follows

Epoch: 00010 tL: nan vE: 4.3044
Epoch: 00020 tL: nan vE: 4.3114
Epoch: 00030 tL: nan vE: 4.2962
Epoch: 00040 tL: nan vE: 4.3009
Epoch: 00050 tL: nan vE: 4.2899

the training loss is always nan no matter how small the training data is, so I think the fundamental problem is in my code where I train it. The validation error is NOT nan so the data itself contains no nan I suppose. Is there a critical issue I'm not addressing in my code? Any help would be appreciated! Thanks in advance.

Answer 1

The reason why validation error showed a normal value and training error did not was because I was making mini batches that hold nan values.

Apparently

sample_size = 12494
batch_size = 128
total_batch = int(sample_size / batch_size)
train_size = int(sample_size * (1 - prop_valid))

and

for batch_idx in range(total_batch):
    train_batch_input = mini_batch(train_input, batch_size, batch_idx)
    train_batch_label = mini_batch(train_label, batch_size, batch_idx)

did not make sense. total_batch should have been int(train_size / batch_size)

It was really hard to find this cause numpy does not return any error when array slicing is out of bounds.

Anyway, hope it helps people with similar issues in the future!

Multicell LSTM RNN returns nan training error

Question

1 answers

solution1
0 ACCPTED 2020-02-20 02:56:56

Multicell LSTM RNN returns nan training error

Question

1 answers

solution1 0 ACCPTED 2020-02-20 02:56:56

solution1
0 ACCPTED 2020-02-20 02:56:56