如何输入大量数据进行训练 Keras model 以防止 RAM 崩溃？

Question

I'm trying to train a Tensorflow Keras model for a sequential image classification task.我正在尝试为顺序图像分类任务训练 Tensorflow Keras model。 The model itself is a simple CNN-RNN model that I've used previously on a classification for 1-D signals, and there's no problem there. model 本身是一个简单的 CNN-RNN model，我之前在 1-D 信号分类中使用过，没有问题。

I am having trouble loading the necessary data to train the model on my computer as the RAM gets full and the whole process crashes .由于 RAM 已满且整个过程崩溃，我无法加载必要的数据以在我的计算机上训练 model。

My data looks like this:我的数据如下所示：

(batch, timesteps, height, width, channels) = (batch, 30, 300, 600, 3) (batch, timesteps, height, width, channels) = (batch, 30, 300, 600, 3)

my data pipeline is in this order:我的数据管道按以下顺序排列：

glob.glob all the files from one folder into a list glob.glob 将一个文件夹中的所有文件放到一个列表中
load all the data from one file, create an array which is about (50, 30, 300, 600, 3)从一个文件加载所有数据，创建一个大约为 (50, 30, 300, 600, 3) 的数组
stack the array from individual file into a continuously growing list using list.append使用 list.append 将单个文件中的数组堆叠到一个不断增长的列表中
after all the individual file data have been appended, np.vstack to create the final data for training/validation在附加了所有单个文件数据之后，np.vstack 创建用于训练/验证的最终数据

The above process was okay, but I think appending/vstack is not a good option when doing image processing due to the size of the data.上述过程还可以，但是由于数据的大小，我认为在进行图像处理时附加/vstack不是一个好的选择。

Is there a way to say have the data saved in a tf.records to reduce overall size?有没有办法说将数据保存在 tf.records 以减少整体大小？ or is there a way to set up a data input pipeline so that data can be loaded in smaller chunks ?或者有没有办法设置数据输入管道，以便可以以较小的块加载数据？

Any help is much appreciated, thank you in advance.任何帮助都非常感谢，在此先感谢您。

Answer 1

What you need is called DataGenerator您需要的称为DataGenerator

Right now your code probably looks like this:现在您的代码可能如下所示：

import numpy as np
from keras.models import Sequential

# Load entire dataset
X, y = np.load('some_training_set_with_labels.npy')

# Design model
model = Sequential()
[...] # Your architecture
model.compile()

# Train model on your dataset
model.fit(x=X, y=y)

Your data-generator will be something like:您的数据生成器将类似于：

import numpy as np
import keras

class DataGenerator(keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, list_IDs, labels, batch_size=32, dim=(32,32,32), n_channels=1,
                 n_classes=10, shuffle=True):
        'Initialization'
        self.dim = dim
        self.batch_size = batch_size
        self.labels = labels
        self.list_IDs = list_IDs
        self.n_channels = n_channels
        self.n_classes = n_classes
        self.shuffle = shuffle
        self.on_epoch_end()

    def __len__(self):
        'Denotes the number of batches per epoch'
        return int(np.floor(len(self.list_IDs) / self.batch_size))

    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        # Find list of IDs
        list_IDs_temp = [self.list_IDs[k] for k in indexes]

        # Generate data
        X, y = self.__data_generation(list_IDs_temp)

        return X, y

    def on_epoch_end(self):
        'Updates indexes after each epoch'
        self.indexes = np.arange(len(self.list_IDs))
        if self.shuffle == True:
            np.random.shuffle(self.indexes)

    def __data_generation(self, list_IDs_temp):
        'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
        # Initialization
        X = np.empty((self.batch_size, *self.dim, self.n_channels))
        y = np.empty((self.batch_size), dtype=int)

        # Generate data
        for i, ID in enumerate(list_IDs_temp):
            # Store sample
            X[i,] = np.load('data/' + ID + '.npy')

            # Store class
            y[i] = self.labels[ID]

        return X, keras.utils.to_categorical(y, num_classes=self.n_classes)

we have to modify our Keras script accordingly so that it accepts the generator that we just created.我们必须相应地修改我们的 Keras 脚本，以便它接受我们刚刚创建的生成器。

import numpy as np

from keras.models import Sequential
from my_classes import DataGenerator

# Parameters
params = {'dim': (32,32,32),
          'batch_size': 64,
          'n_classes': 6,
          'n_channels': 1,
          'shuffle': True}

# Datasets
partition = # IDs
labels = # Labels

# Generators
training_generator = DataGenerator(partition['train'], labels, **params)
validation_generator = DataGenerator(partition['validation'], labels, **params)

# Design model
model = Sequential()
[...] # Architecture
model.compile()

# Train model on dataset
model.fit_generator(generator=training_generator,
                    validation_data=validation_generator,
                    use_multiprocessing=True,
                    workers=4)

Have a look at Stanford University website for more details.查看斯坦福大学网站了解更多详情。 It's a bit dated.它有点过时了。 Have a look at pyimagesearch tutorial for more recent things查看pyimagesearch 教程了解更多最新内容

如何输入大量数据进行训练 Keras model 以防止 RAM 崩溃？

问题描述

1 个解决方案

解决方案1
1 2021-01-28 05:31:06

如何输入大量数据进行训练 Keras model 以防止 RAM 崩溃？

问题描述

1 个解决方案

解决方案1 1 2021-01-28 05:31:06

解决方案1
1 2021-01-28 05:31:06