简体   繁体   English

带有Tensorflow后端的Keras ---带有检查点回调的model.fit()中的Memoryerror

[英]Keras with Tensorflow backend---Memoryerror in model.fit() with checkpoint callbacks

I'm trying to train an autoencoder. 我正在尝试训练自动编码器。 It keeps getting Memoryerror from Keras at model.fit(), it always occurs when i add validation-related parameters to model.fit like validation_split. 它会不断从model.fit()的Keras处获取Memoryerror,当我将与验证相关的参数(如validation_split)添加到model.fit时,它总是会发生。

Error: 错误:

 Traceback (most recent call last): File "/root/abnormal-spatiotemporal-ae/start_train.py", line 53, in <module> train(dataset=dataset, job_folder=job_folder, logger=logger) File "/root/abnormal-spatiotemporal-ae/classifier.py", line 109, in train callbacks=[snapshot, earlystop, history_log] File "/root/anaconda3/envs/py35/lib/python3.5/site-packages/keras/engine/training.py", 

line 990, in fit y, val_y = (slice_arrays(y, 0, split_at), File "/root/anaconda3/envs/py35/lib/python3.5/site-packages/keras/utils/generic_utils.py", line 528, in slice_arrays return [None if x is None else x[start:stop] for x in arrays] File "/root/anaconda3/envs/py35/lib/python3.5/site-packages/keras/utils/generic_utils.py", line 528, in return [None if x is None else x[start:stop] for x in arrays] File "/root/anaconda3/envs/py35/lib/python3.5/site-packages/keras/utils/io_utils.py", line 110, in getitem return self.data[idx] File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "/root/anaconda3/envs/py35/lib/python3.5/site-packages/h5py/_hl/dataset.py", line 485, in getitem arr = numpy.ndarray(mshape, new_dtype, order='C') MemoryError 第990行,适合y,val_y =(slice_arrays(y,0,split_at),文件“ /root/anaconda3/envs/py35/lib/python3.5/site-packages/keras/utils/generic_utils.py”,行528,在slice_arrays中,返回[如果x为None,则返回None;对于数组中的x,返回x [start:stop]。]文件“ /root/anaconda3/envs/py35/lib/python3.5/site-packages/keras/utils/generic_utils。 py”,第528行,返回[如果x为其他,则无,则x [start:stop]对于数组中的x]文件“ /root/anaconda3/envs/py35/lib/python3.5/site-packages/keras/utils /io_utils.py”,第110行,在getitem中返回self.data [idx]文件“ h5py / _objects.pyx”,第54行,在h5py._objects.with_phil.wrapper文件中,“ h5py / _objects.pyx”,第55行,在h5py._objects.with_phil.wrapper文件“ /root/anaconda3/envs/py35/lib/python3.5/site-packages/h5py/_hl/dataset.py”中,第485行,在getitem中, arr = numpy.ndarray(mshape ,new_dtype,order ='C')MemoryError

Code: 码:

data = HDF5Matrix(os.path.join(video_root_path, '{0}/{0}_train_t{1}.h5'.format(dataset, time_length)),
                  'data')

snapshot = ModelCheckpoint(os.path.join(job_folder,
           'model_snapshot_e{epoch:03d}_{val_loss:.6f}.h5'))
earlystop = EarlyStopping(patience=10)
history_log = LossHistory(job_folder=job_folder, logger=logger)

logger.info("Initializing training...")

history = model.fit(
    data,
    data,
    batch_size=batch_size,
    epochs=nb_epoch,
    validation_split=0.15,
    shuffle='batch',
    callbacks=[snapshot, earlystop, history_log]
)

The code will run correctly when i remove validation_split=0.15 in model.fit and snapshot in callbacks. 当我在model.fit中删除validate_split = 0.15并在回调中删除快照时,代码将正确运行。

data variable contains all processed images from training dataset, its shape is (15200, 8, 224, 224, 1) and size is 6101401600 This code is used on computer with 64GB RAM and a Tesla P100, no worry for memory space, and my python is 64-bit 数据变量包含训练数据集中的所有已处理图像,其形状为(15200、8、224、224、1),大小为6101401600。此代码在具有64GB RAM和Tesla P100的计算机上使用,不用担心存储空间, python是64位

Model: 模型:

input_tensor = Input(shape=(t, 224, 224, 1))

    conv1 = TimeDistributed(Conv2D(128, kernel_size=(11, 11), padding='same', strides=(4, 4), name='conv1'),
                            input_shape=(t, 224, 224, 1))(input_tensor)
    conv1 = TimeDistributed(BatchNormalization())(conv1)
    conv1 = TimeDistributed(Activation('relu'))(conv1)

    conv2 = TimeDistributed(Conv2D(64, kernel_size=(5, 5), padding='same', strides=(2, 2), name='conv2'))(conv1)
    conv2 = TimeDistributed(BatchNormalization())(conv2)
    conv2 = TimeDistributed(Activation('relu'))(conv2)

    convlstm1 = ConvLSTM2D(64, kernel_size=(3, 3), padding='same', return_sequences=True, name='convlstm1')(conv2)
    convlstm2 = ConvLSTM2D(32, kernel_size=(3, 3), padding='same', return_sequences=True, name='convlstm2')(convlstm1)
    convlstm3 = ConvLSTM2D(64, kernel_size=(3, 3), padding='same', return_sequences=True, name='convlstm3')(convlstm2)

    deconv1 = TimeDistributed(Conv2DTranspose(128, kernel_size=(5, 5), padding='same', strides=(2, 2), name='deconv1'))(convlstm3)
    deconv1 = TimeDistributed(BatchNormalization())(deconv1)
    deconv1 = TimeDistributed(Activation('relu'))(deconv1)

    decoded = TimeDistributed(Conv2DTranspose(1, kernel_size=(11, 11), padding='same', strides=(4, 4), name='deconv2'))(
        deconv1)

This question faced the same problem. 这个问题也面临着同样的问题。 Here the explanation was, that there were too much data points before the flattening layer. 在这里的解释是,在平坦化层之前有太多数据点。 This caused the RAM to overflow. 这导致RAM溢出。 This was solved by adding additional convolution layer. 通过添加其他卷积层可以解决此问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM