簡體   English   中英

帶有Tensorflow后端的Keras ---帶有檢查點回調的model.fit()中的Memoryerror

[英]Keras with Tensorflow backend---Memoryerror in model.fit() with checkpoint callbacks

我正在嘗試訓練自動編碼器。 它會不斷從model.fit()的Keras處獲取Memoryerror,當我將與驗證相關的參數(如validation_split)添加到model.fit時,它總是會發生。

錯誤:

 Traceback (most recent call last): File "/root/abnormal-spatiotemporal-ae/start_train.py", line 53, in <module> train(dataset=dataset, job_folder=job_folder, logger=logger) File "/root/abnormal-spatiotemporal-ae/classifier.py", line 109, in train callbacks=[snapshot, earlystop, history_log] File "/root/anaconda3/envs/py35/lib/python3.5/site-packages/keras/engine/training.py", 

第990行,適合y,val_y =(slice_arrays(y,0,split_at),文件“ /root/anaconda3/envs/py35/lib/python3.5/site-packages/keras/utils/generic_utils.py”,行528,在slice_arrays中,返回[如果x為None,則返回None;對於數組中的x,返回x [start:stop]。]文件“ /root/anaconda3/envs/py35/lib/python3.5/site-packages/keras/utils/generic_utils。 py”,第528行,返回[如果x為其他,則無,則x [start:stop]對於數組中的x]文件“ /root/anaconda3/envs/py35/lib/python3.5/site-packages/keras/utils /io_utils.py”,第110行,在getitem中返回self.data [idx]文件“ h5py / _objects.pyx”,第54行,在h5py._objects.with_phil.wrapper文件中,“ h5py / _objects.pyx”,第55行,在h5py._objects.with_phil.wrapper文件“ /root/anaconda3/envs/py35/lib/python3.5/site-packages/h5py/_hl/dataset.py”中,第485行,在getitem中, arr = numpy.ndarray(mshape ,new_dtype,order ='C')MemoryError

碼:

data = HDF5Matrix(os.path.join(video_root_path, '{0}/{0}_train_t{1}.h5'.format(dataset, time_length)),
                  'data')

snapshot = ModelCheckpoint(os.path.join(job_folder,
           'model_snapshot_e{epoch:03d}_{val_loss:.6f}.h5'))
earlystop = EarlyStopping(patience=10)
history_log = LossHistory(job_folder=job_folder, logger=logger)

logger.info("Initializing training...")

history = model.fit(
    data,
    data,
    batch_size=batch_size,
    epochs=nb_epoch,
    validation_split=0.15,
    shuffle='batch',
    callbacks=[snapshot, earlystop, history_log]
)

當我在model.fit中刪除validate_split = 0.15並在回調中刪除快照時,代碼將正確運行。

數據變量包含訓練數據集中的所有已處理圖像,其形狀為(15200、8、224、224、1),大小為6101401600。此代碼在具有64GB RAM和Tesla P100的計算機上使用,不用擔心存儲空間, python是64位

模型:

input_tensor = Input(shape=(t, 224, 224, 1))

    conv1 = TimeDistributed(Conv2D(128, kernel_size=(11, 11), padding='same', strides=(4, 4), name='conv1'),
                            input_shape=(t, 224, 224, 1))(input_tensor)
    conv1 = TimeDistributed(BatchNormalization())(conv1)
    conv1 = TimeDistributed(Activation('relu'))(conv1)

    conv2 = TimeDistributed(Conv2D(64, kernel_size=(5, 5), padding='same', strides=(2, 2), name='conv2'))(conv1)
    conv2 = TimeDistributed(BatchNormalization())(conv2)
    conv2 = TimeDistributed(Activation('relu'))(conv2)

    convlstm1 = ConvLSTM2D(64, kernel_size=(3, 3), padding='same', return_sequences=True, name='convlstm1')(conv2)
    convlstm2 = ConvLSTM2D(32, kernel_size=(3, 3), padding='same', return_sequences=True, name='convlstm2')(convlstm1)
    convlstm3 = ConvLSTM2D(64, kernel_size=(3, 3), padding='same', return_sequences=True, name='convlstm3')(convlstm2)

    deconv1 = TimeDistributed(Conv2DTranspose(128, kernel_size=(5, 5), padding='same', strides=(2, 2), name='deconv1'))(convlstm3)
    deconv1 = TimeDistributed(BatchNormalization())(deconv1)
    deconv1 = TimeDistributed(Activation('relu'))(deconv1)

    decoded = TimeDistributed(Conv2DTranspose(1, kernel_size=(11, 11), padding='same', strides=(4, 4), name='deconv2'))(
        deconv1)

這個問題也面臨着同樣的問題。 在這里的解釋是,在平坦化層之前有太多數據點。 這導致RAM溢出。 通過添加其他卷積層可以解決此問題。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM