使用Keras ImageDataGenerator时出现内存错误

Question

I am attempting to predict features in imagery using keras with a TensorFlow backend. 我试图使用带有TensorFlow后端的keras来预测图像中的特征。 Specifically, I am attempting to use a keras ImageDataGenerator . 具体来说，我试图使用keras ImageDataGenerator 。 The model is set to run for 4 epochs and runs fine until the 4th epoch where it fails with a MemoryError. 该模型设置为运行4个时期并运行良好，直到它失败并出现MemoryError的第4个时期。

I am running this model on an AWS g2.2xlarge instance running Ubuntu Server 16.04 LTS (HVM), SSD Volume Type. 我在运行Ubuntu Server 16.04 LTS（HVM），SSD卷类型的AWS g2.2xlarge实例上运行此模型。

The training images are 256x256 RGB pixel tiles (8 bit unsigned) and the training mask is 256x256 single band (8 bit unsigned) tiled data where 255 == a feature of interest and 0 == everything else. 训练图像是256x256 RGB像素块（8位无符号），训练掩码是256x256单波段（8位无符号）平铺数据，其中255 = =感兴趣的特征，0 ==其他所有。

The following 3 functions are the ones pertinent to this error. 以下3个函数是与此错误相关的函数。

How can I resolve this MemoryError? 我该如何解决这个MemoryError？

def train_model():
        batch_size = 1
        training_imgs = np.lib.format.open_memmap(filename=os.path.join(data_path, 'data.npy'),mode='r+')
        training_masks = np.lib.format.open_memmap(filename=os.path.join(data_path, 'mask.npy'),mode='r+')
        dl_model = create_model()
        print(dl_model.summary())
        model_checkpoint = ModelCheckpoint(os.path.join(data_path,'mod_weight.hdf5'), monitor='loss',verbose=1, save_best_only=True)
        dl_model.fit_generator(generator(training_imgs, training_masks, batch_size), steps_per_epoch=(len(training_imgs)/batch_size), epochs=4,verbose=1,callbacks=[model_checkpoint])

def generator(train_imgs, train_masks=None, batch_size=None):

# Create empty arrays to contain batch of features and labels#

        if train_masks is not None:
                train_imgs_batch = np.zeros((batch_size,y_to_res,x_to_res,bands))
                train_masks_batch = np.zeros((batch_size,y_to_res,x_to_res,1))

                while True:
                        for i in range(batch_size):
                                # choose random index in features
                                index= random.choice(range(len(train_imgs)))
                                train_imgs_batch[i] = train_imgs[index]
                                train_masks_batch[i] = train_masks[index]
                        yield train_imgs_batch, train_masks_batch
        else:
                rec_imgs_batch = np.zeros((batch_size,y_to_res,x_to_res,bands))
                while True:
                        for i in range(batch_size):
                                # choose random index in features
                                index= random.choice(range(len(train_imgs)))
                                rec_imgs_batch[i] = train_imgs[index]
                        yield rec_imgs_batch

def train_generator(train_images,train_masks,batch_size):
        data_gen_args=dict(rotation_range=90.,horizontal_flip=True,vertical_flip=True,rescale=1./255)
        image_datagen = ImageDataGenerator()
        mask_datagen = ImageDataGenerator()
# # Provide the same seed and keyword arguments to the fit and flow methods
        seed = 1
        image_datagen.fit(train_images, augment=True, seed=seed)
        mask_datagen.fit(train_masks, augment=True, seed=seed)
        image_generator = image_datagen.flow(train_images,batch_size=batch_size)
        mask_generator = mask_datagen.flow(train_masks,batch_size=batch_size)
        return zip(image_generator, mask_generator)

The following os the output from the model detailing the epochs and the error message: 以下是模型的输出，详细说明了时期和错误信息：

Epoch 00001: loss improved from inf to 0.01683, saving model to /home/ubuntu/deep_learn/client_data/mod_weight.hdf5
Epoch 2/4
7569/7569 [==============================] - 3394s 448ms/step - loss: 0.0049 - binary_crossentropy: 0.0027 - jaccard_coef_int: 0.9983  

Epoch 00002: loss improved from 0.01683 to 0.00492, saving model to /home/ubuntu/deep_learn/client_data/mod_weight.hdf5
Epoch 3/4
7569/7569 [==============================] - 3394s 448ms/step - loss: 0.0049 - binary_crossentropy: 0.0026 - jaccard_coef_int: 0.9982  

Epoch 00003: loss improved from 0.00492 to 0.00488, saving model to /home/ubuntu/deep_learn/client_data/mod_weight.hdf5
Epoch 4/4
7569/7569 [==============================] - 3394s 448ms/step - loss: 0.0074 - binary_crossentropy: 0.0042 - jaccard_coef_int: 0.9975  

Epoch 00004: loss did not improve
Traceback (most recent call last):
  File "image_rec.py", line 291, in <module>
    train_model()
  File "image_rec.py", line 208, in train_model
    dl_model.fit_generator(train_generator(training_imgs,training_masks,batch_size),steps_per_epoch=1,epochs=1,workers=1)
  File "image_rec.py", line 274, in train_generator
    image_datagen.fit(train_images, augment=True, seed=seed)
  File "/home/ubuntu/pyvirt_test/local/lib/python2.7/site-packages/keras/preprocessing/image.py", line 753, in fit
    x = np.copy(x)
  File "/home/ubuntu/pyvirt_test/local/lib/python2.7/site-packages/numpy/lib/function_base.py", line 1505, in copy
    return array(a, order=order, copy=True)
MemoryError

Answer 1

it seems your problem is due to the data is too huge. 看来你的问题是由于数据太大了。 I can see two solutions. 我可以看到两种解决方案。 The first one is run your code in a distributed system by means of spark, I guess you do not have this support, so let us move on to the other. 第一个是通过spark在分布式系统中运行你的代码，我想你没有这个支持，所以让我们继续前进到另一个。

The second one is which I think is viable. 第二个是我认为可行的。 I would slice the data and I would try feeding the model incrementally. 我会切片数据，我会尝试逐步喂养模型。 We can do this with Dask . 我们可以用Dask做到这一点。 This library can slice the data and save in objects which then you can retrieve reading from disk, only in the part you want. 该库可以对数据进行切片并保存在对象中，然后您可以从磁盘中检索读取，只能在您想要的部分中进行读取。

If you have a image which size is an matrix of 100x100, we can retrieve each array without the needed to load the 100 arrays in memory. 如果你的图像大小是100x100的矩阵，我们可以检索每个数组，而无需在内存中加载100个数组。 We can load array by array in memory (releasing the previous one), which would be the input in your Neural Network. 我们可以在内存中加载数组（释放前一个数组），这将是神经网络中的输入。

To do this, you can to transform your np.array to dask array and assign the partitions. 为此，您可以将np.array转换为dask数组并分配分区。 For example: 例如：

>>> k = np.random.randn(10,10) # Matrix 10x10
>>> import dask.array as da
>>> k2 = da.from_array(k,chunks = 3)
dask.array<array, shape=(10, 10), dtype=float64, chunksize=(3, 3)>
>>> k2.to_delayed()
array([[Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 0, 0)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 0, 1)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 0, 2)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 0, 3))],
   [Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 1, 0)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 1, 1)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 1, 2)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 1, 3))],
   [Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 2, 0)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 2, 1)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 2, 2)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 2, 3))],
   [Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 3, 0)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 3, 1)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 3, 2)),
    Delayed(('array-a08c1d25b900d497cdcd233a7c5aa108', 3, 3))]],
  dtype=object)

Here, you can see how the data is saved in objects, and then you can retrieve in parts to feed your model. 在这里，您可以看到数据如何保存在对象中，然后您可以检索部分内容以提供模型。

To implement this solution you must introduce a loop in your function which call each partition and feed the NN to get the incremental trainning. 要实现此解决方案，您必须在函数中引入一个循环，该循环调用每个分区并提供NN以获得增量训练。

For more information, see Dask documentation 有关更多信息，请参阅Dask文档

Answer 2

You provided quite confusing code (in my opinion), ie. 你提供了相当令人困惑的代码（在我看来），即。 no call to the train_generator is visible. 没有看到对train_generator调用。 I am not sure that this is a problem of insufficient memore due to a big data, since you use memmap for that, but lets assume for now it is. 我不确定这是一个由于大数据而导致内存不足的问题，因为你使用了memmap，但我们现在假设它是。

If the data is quite big and since you're loading the images from directory anyway, it might be worthy considering to use ImageDataGenerator 's flow_from_directory method. 如果数据非常大并且因为您无论如何都要从目录加载图像，那么考虑使用ImageDataGenerator的flow_from_directory方法可能是值得的。 It would require a slight change of design, tho, which might not be what you want. 它需要稍微改变一下设计，这可能不是你想要的。

You can load it in the following manner: 您可以按以下方式加载它：

train_datagen = ImageDataGenerator()
train_generator = train_datagen.flow_from_directory(
        'data/train',
        target_size=(256, 256),
        batch_size=batch_size,
        ...  # other configurations)

More on that in the Keras documentation . 更多关于Keras文档的内容。

Also note that if you have 32bit, the memmap does not allow more than 2GB. 另请注意，如果您有32位，则memmap不允许超过2GB。
Do you use tensorflow-gpu , by any chance? 你有没有使用tensorflow-gpu ？ Maybe your gpu is not sufficient, you could try this with the tensorflow package. 也许你的gpu是不够的，你可以尝试使用tensorflow包。

I would strongly suggest to try some memory profiling to see where bigger allocations of memory happen. 我强烈建议尝试一些内存分析，以查看更大的内存分配发生的位置。

If it was not the case of insufficient memory, It might be wrong handling of the data in your model, since your loss function is not improving at all, it could be miswired for example. 如果不是内存不足的情况，那么处理模型中的数据可能是错误的，因为您的损失功能根本没有改善，例如它可能是错误的。

Finally, the last note here .. it is good practice to load the memmap of training data as read-only , since you don't want to accidentaly mess the data. 最后，这里的最后一个注释......最好将训练数据的memmap加载为read-only ，因为你不想意外地弄乱数据。

UPDATE : I can see that you've updated the post and provided the code for the train_generator method, but there is still no call to that method in your call. 更新：我可以看到你已经更新了帖子并提供了train_generator方法的代码，但是在你的调用中仍然没有调用该方法。

If I assume that you have a typo in the call - train_generator instead of the generator method in your d1_model.fit_generator method, it is possible that the fit_generator method is not working on a batch of data, but actually on the whole training_imgs and it copys over the whole set in the np.copy(x) call. 如果我假设你已经在呼叫一个错字- train_generator ，而不是generator在你的方法d1_model.fit_generator方法，它有可能是fit_generator方法不工作的一批数据，但实际上对整个training_imgs并copys在np.copy(x)调用中的整个集合。

Also, as mentioned already, there indeed are (you can find some of them, fe. here is an open one ) a few issues with Keras memory leak when using the fit and fit_generator methods. 另外，正如已经提到的，当使用fit和fit_generator方法时，确实存在（你可以找到其中一些，fe。这里是一个开放的） fit_generator内存泄漏的一些问题。

Answer 3

This is common when running 32bit if the float precision is too high. 如果浮点精度太高，则在运行32位时很常见。 Are you running 32bit? 你在运行32位吗？ You may also consider casting or rounding the array. 您也可以考虑对阵列进行转换或舍入。

Answer 4

Generally Keras/Tensorflow is very good with resource usage, but there is a known memory leak that has caused problems in the past. 通常Keras / Tensorflow在资源使用方面非常好，但是已知的内存泄漏在过去已经引起了问题。 To make sure that's not the one causing your problems, try including these two lines of code to your training script: 要确保不是导致问题的那个，请尝试将这两行代码包含在您的培训脚本中：

# load the backend
from keras import backend as K

# prevent Tensorflow memory leakage
K.clear_session()

Answer 5

I met the same problem recently. 我最近遇到了同样的问题。 Somehow the the FCN-8 code can run successfully on my tensorflow1.2+keras2.0.9+8G RAM +1060 computer, but occurred memory error when using modelcheckpoint on my tf1.4+keras2.1.5+16g ram +1080ti computer. 不知何故，FCN-8代码可以在我的tensorflow1.2 + keras2.0.9 + 8G RAM +1060计算机上成功运行，但在我的tf1.4 + keras2.1.5 + 16g ram + 1080ti计算机上使用modelcheckpoint时出现内存错误。

使用Keras ImageDataGenerator时出现内存错误

问题描述

5 个解决方案

解决方案1
9 2018-03-27 07:28:59

解决方案2
5 2018-03-30 08:09:42

解决方案3
4 2018-03-26 18:29:34

解决方案4
4 2018-03-31 21:50:38

解决方案5
1 2018-04-05 02:53:09

使用Keras ImageDataGenerator时出现内存错误

问题描述

5 个解决方案

解决方案1 9 2018-03-27 07:28:59

解决方案2 5 2018-03-30 08:09:42

解决方案3 4 2018-03-26 18:29:34

解决方案4 4 2018-03-31 21:50:38

解决方案5 1 2018-04-05 02:53:09

解决方案1
9 2018-03-27 07:28:59

解决方案2
5 2018-03-30 08:09:42

解决方案3
4 2018-03-26 18:29:34

解决方案4
4 2018-03-31 21:50:38

解决方案5
1 2018-04-05 02:53:09