Keras training crashes mid epoch after multiple correct executions

Question

I am trying create a Cudgru based model that predicts sequence of 7 features that are interrelated. Here's my keras model summary:

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
cu_dnngru_1 (CuDNNGRU)       (None, 49, 100)           32700
_________________________________________________________________
dropout_1 (Dropout)          (None, 49, 100)           0
_________________________________________________________________
cu_dnngru_2 (CuDNNGRU)       (None, 49, 100)           60600
_________________________________________________________________
dropout_2 (Dropout)          (None, 49, 100)           0
_________________________________________________________________
cu_dnngru_3 (CuDNNGRU)       (None, 49, 100)           60600
_________________________________________________________________
dropout_3 (Dropout)          (None, 49, 100)           0
_________________________________________________________________
cu_dnngru_4 (CuDNNGRU)       (None, 49, 100)           60600
_________________________________________________________________
dropout_4 (Dropout)          (None, 49, 100)           0
_________________________________________________________________
cu_dnngru_5 (CuDNNGRU)       (None, 49, 100)           60600
_________________________________________________________________
dropout_5 (Dropout)          (None, 49, 100)           0
_________________________________________________________________
cu_dnngru_6 (CuDNNGRU)       (None, 49, 100)           60600
_________________________________________________________________
dropout_6 (Dropout)          (None, 49, 100)           0
_________________________________________________________________
cu_dnngru_7 (CuDNNGRU)       (None, 49, 100)           60600
_________________________________________________________________
dropout_7 (Dropout)          (None, 49, 100)           0
_________________________________________________________________
flatten_1 (Flatten)          (None, 4900)              0
_________________________________________________________________
dense_1 (Dense)              (None, 7)                 34307
=================================================================
Total params: 430,607
Trainable params: 430,607
Non-trainable params: 0

I'm trying to run this model for higher number of epochs. First few epochs are fine, but then it errors out:

Model] Model Compiled
Time taken: 0:00:02.314468
[Model] Training Started
[Model] 100 epochs, 1000 batch size, 20.0 batches per epoch
Epoch 1/100
20/20 [==============================] - 5s 240ms/step - loss: 0.1631 - acc: 0.2905
Epoch 2/100
20/20 [==============================] - 2s 81ms/step - loss: 0.1288 - acc: 0.2455
Epoch 3/100
20/20 [==============================] - 1s 73ms/step - loss: 0.0952 - acc: 0.5058
Epoch 4/100
20/20 [==============================] - 2s 76ms/step - loss: 0.1141 - acc: 0.3288
Epoch 5/100
20/20 [==============================] - 2s 75ms/step - loss: 0.1064 - acc: 0.3425
Epoch 6/100
20/20 [==============================] - 1s 75ms/step - loss: 0.0767 - acc: 0.4213
Epoch 7/100
20/20 [==============================] - 1s 75ms/step - loss: 0.0635 - acc: 0.4764
Epoch 8/100
20/20 [==============================] - 1s 74ms/step - loss: 0.0555 - acc: 0.5274
Epoch 9/100
20/20 [==============================] - 1s 74ms/step - loss: 0.0544 - acc: 0.5141
Epoch 10/100
...
Epoch 61/100
20/20 [==============================] - 1s 74ms/step - loss: 0.0506 - acc: 0.3925
Epoch 62/100
20/20 [==============================] - 1s 72ms/step - loss: 0.0495 - acc: 0.4323
Epoch 63/100
20/20 [==============================] - 1s 73ms/step - loss: 0.0495 - acc: 0.4118
Epoch 64/100
 2/20 [==>...........................] - ETA: 1s - loss: 0.0495 - acc: 0.4885Traceback (most recent call last):
  File "./run.py", line 118, in <module>
    main()
  File "./run.py", line 92, in main
    steps_per_epoch=steps_per_epoch)
  File "/home/sridhar/PE_CSV/alarmProj/rnn/lstm/core/model.py", line 149, in train_generator
    workers=70)
  File "/home/sridhar/PE_CSV/malenv/local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/home/sridhar/PE_CSV/malenv/local/lib/python2.7/site-packages/keras/engine/training.py", line 1415, in fit_generator
    initial_epoch=initial_epoch)
  File "/home/sridhar/PE_CSV/malenv/local/lib/python2.7/site-packages/keras/engine/training_generator.py", line 213, in fit_generator
    class_weight=class_weight)
  File "/home/sridhar/PE_CSV/malenv/local/lib/python2.7/site-packages/keras/engine/training.py", line 1209, in train_on_batch
    class_weight=class_weight)
  File "/home/sridhar/PE_CSV/malenv/local/lib/python2.7/site-packages/keras/engine/training.py", line 749, in _standardize_user_data
    exception_prefix='input')
  File "/home/sridhar/PE_CSV/malenv/local/lib/python2.7/site-packages/keras/engine/training_utils.py", line 127, in standardize_input_data
    'with shape ' + str(data_shape))
ValueError: Error when checking input: expected cu_dnngru_1_input to have 3 dimensions, but got array with shape (380, 1)

If I reduce the number of epochs to less than that value (say epoch 64 here), I don't have any issues, but increasing the number of epochs causes the above error at some point. The exact number of epochs where it crashes seem to vary with any change to the configuration. The same issue is seen with vanilla GRU/LSTM layers.

This is keras-2.2.2 and the model is being compiled with 70 worker threads.

Is there something I could do to avoid this issue?

Edit: Here's the relevant approximate code used:

session_conf = tf.ConfigProto(
            inter_op_parallelism_threads=multiprocessing.cpu_count(),
            intra_op_parallelism_threads=multiprocessing.cpu_count())
        sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
K.set_session(sess)

self.model.add(CuDNNGRU(
               100,
               input_shape=(49,7),
               kernel_initializer='orthogonal',
               return_sequences=true))
self.model.add(Dropout(0.4))
self.model.add(CuDNNGRU(
               100,
               input_shape=(None,None),
               kernel_initializer='orthogonal',
               return_sequences=true))
self.model.add(Dropout(0.4))
self.model.add(CuDNNGRU(
               100,
               input_shape=(None,None),
               kernel_initializer='orthogonal',
               return_sequences=true))
self.model.add(Dropout(0.4))
self.model.add(CuDNNGRU(
               100,
               input_shape=(None,None),
               kernel_initializer='orthogonal',
               return_sequences=true))
self.model.add(Dropout(0.4))
self.model.add(CuDNNGRU(
               100,
               input_shape=(None,None),
               kernel_initializer='orthogonal',
               return_sequences=true))
self.model.add(Dropout(0.4))
self.model.add(CuDNNGRU(
               100,
               input_shape=(None,None),
               kernel_initializer='orthogonal',
               return_sequences=true))
self.model.add(Dropout(0.4))
self.model.add(CuDNNGRU(
               100,
               input_shape=(None,None),
               kernel_initializer='orthogonal',
               return_sequences=true))
self.model.add(Dropout(0.4))

elf.model.add(Flatten())
self.model.add(Dense(7, activation='relu'))

sgd = SGD(lr=0.1, decay=1e-2, clipnorm=5.0)

self.model.compile(
            loss='mse',
            metrics=["accuracy"],
            optimizer=sgd)
===================

 def train_generator(self, data_gen, epochs, batch_size, steps_per_epoch):
        timer = Timer()
        timer.start()
        print('[Model] Training Started')
        print('[Model] %s epochs, %s batch size, %s batches per epoch' %
              (epochs, batch_size, steps_per_epoch))

        save_fname = '%s/%s-e%s.h5' % (self.model_dir, dt.datetime.now()
                                       .strftime('%d%m%Y-%H%M%S'), str(epochs))
        callbacks = [
            ModelCheckpoint(
                filepath=save_fname, monitor='loss', save_best_only=True)
        ]
        try:
            self.model.fit_generator(
                data_gen,
                steps_per_epoch=steps_per_epoch,
                epochs=epochs,
                callbacks=callbacks)
        except:
            pdb.set_trace()
)

        print('[Model] Training Completed. Model saved as %s' % save_fname)
        timer.stop()
=============
    #invoked from main function
    model.train_generator(
        data_gen=data.generate_train_batch(
            seq_len=50,
            batch_size=1000,
            normalise=false),
            epochs=100,
            batch_size=1000,
            steps_per_epoch=steps_per_epoch)
=============

    def generate_train_batch(self, seq_len, batch_size, normalise):
        '''Yield a generator of training data from filename on given list of cols split for train/test'''
        i = 0
        while i < (self.len_train - seq_len):
            x_batch = []
            y_batch = []
            for b in range(batch_size):
                if i >= (self.len_train - seq_len):
                    # stop-condition for a smaller final batch if data doesn't divide evenly

                    yield np.array(x_batch), np.array(y_batch)
                x, y = self._next_window(i, seq_len, normalise)
                x_batch.append(x)
                y_batch.append(y)
                i += 1

            yield np.array(x_batch), np.array(y_batch)
=======================

Answer 1

The generator was wrong. It incorrectly assumes a finite generator whilst keras expects an infinite one.

Keras training crashes mid epoch after multiple correct executions

Question

1 answers

solution1
0 2018-09-29 18:18:23

Keras training crashes mid epoch after multiple correct executions

Question

1 answers

solution1 0 2018-09-29 18:18:23

solution1
0 2018-09-29 18:18:23