简体   繁体   English

从熊猫数据框加载Keras中的批量图像

[英]Loading batches of images in Keras from pandas dataframe

I have a pandas dataframe with two columns, one that has paths to images and the other has string class labels. 我有一个包含两列的pandas数据帧,一列有图像路径,另一列有字符串类标签。

I have also written the following functions, which from the dataframe loads the images, renormalizes them and converts the class labels to one-hot vectors. 我还编写了以下函数,这些函数从数据框加载图像,重新规范化它们并将类标签转换为单热矢量。

def prepare_data(df):
    data_X, data_y = df.values[:,0], df.values[:,1]

    # Load images
    data_X = np.array([np.array(imread(fname)) for fname in data_X])

    # Normalize input
    data_X = data_X / 255 - 0.5

    # Prepare labels
    data_y = np.array([label2int[label] for label in data_y])
    data_y = to_categorical(data_y)

    return data_X, data_y

I want to feed this dataframe to a Keras CNN, but the whole dataset is too big to be loaded in memory at once. 我想将此数据帧提供给Keras CNN,但整个数据集太大而无法立即加载到内存中。

Other answers in this site tell me that for that purpose I should use a Keras ImageDataGenerator, but honestly I do not understand how to do it from the documentation. 本网站的其他答案告诉我,为此我应该使用Keras ImageDataGenerator,但说实话,我不明白如何从文档中做到这一点。

What is the easiest way of feeding the data in lazy loaded batches to the model? 将延迟加载批次中的数据提供给模型的最简单方法是什么?

If it is a ImageDataGenerator, how do I create a ImageDataGenerator that takes on initialization the Dataframe and passes the batches through my function to create the appropriate numpy arrays? 如果它是ImageDataGenerator,我如何创建一个ImageDataGenerator,它接受Dataframe的初始化并通过我的函数传递批次以创建适当的numpy数组? And how do I fit the model using the ImageDataGenerator? 我如何使用ImageDataGenerator拟合模型?

ImageDataGenerator is a high-level class that allows to yield data from multiple sources (from np arrays , from directories...) and that includes utility functions to perform image augmentation et cetera. ImageDataGenerator是一个高级类,允许从多个源(来自np arrays ,来自目录......)生成数据,并且包括用于执行图像增强等的实用程序函数。

UPDATE UPDATE

As of keras-preprocessing 1.0.4, ImageDataGenerator comes with a flow_from_dataframe method which addresses your case. keras-preprocessing 1.0.4开始, ImageDataGenerator附带了一个flow_from_dataframe方法 ,可以解决您的问题。 It requires dataframe and directory arguments defined as follows: 它需要定义如下的dataframedirectory参数:

dataframe: Pandas dataframe containing the filenames of the
           images in a column and classes in another or column/s
           that can be fed as raw target data.
directory: string, path to the target directory that contains all
           the images mapped in the dataframe.

So no more need to implement it yourself. 所以不再需要自己实现它。


Original answer below 原答案如下

In your case, with the dataframe as you describe it, you could also write your own custom generator that makes use of the logic in your prepare_data function as a more minimalistic solution. 在您的情况下,使用您描述的数据框,您还可以编写自己的自定义生成器,将prepare_data函数中的逻辑用作更简约的解决方案。 It's good practice to make use of Keras' Sequence object to do so, since it allows to use multiprocessing (which will help to avoid bottlenecking your gpu, if you are using one). 最好使用Keras的Sequence对象来实现这一点,因为它允许使用多处理(这有助于避免瓶颈你的gpu,如果你使用的话)。

You can check out the docs on the Sequence object, it contains an implementation example. 您可以查看Sequence对象上的文档 ,它包含一个实现示例。 Eventually, your code would be something along these lines (this is boilerplate code, you will have to add specifics like your label2int function or the image preprocessing logic): 最终,你的代码将是这些代码(这是样板代码,你必须添加像label2int函数或图像预处理逻辑的label2int ):

from keras.utils import Sequence
class DataSequence(Sequence):
    """
    Keras Sequence object to train a model on larger-than-memory data.
    """
    def __init__(self, df, batch_size, mode='train'):
        self.df = df # your pandas dataframe
        self.bsz = batch_size # batch size
        self.mode = mode # shuffle when in train mode

        # Take labels and a list of image locations in memory
        self.labels = self.df['label'].values
        self.im_list = self.df['image_name'].tolist()

    def __len__(self):
        # compute number of batches to yield
        return int(math.ceil(len(self.df) / float(self.bsz)))

    def on_epoch_end(self):
        # Shuffles indexes after each epoch if in training mode
        self.indexes = range(len(self.im_list))
        if self.mode == 'train':
            self.indexes = random.sample(self.indexes, k=len(self.indexes))

    def get_batch_labels(self, idx):
        # Fetch a batch of labels
        return self.labels[idx * self.bsz: (idx + 1) * self.bsz]

    def get_batch_features(self, idx):
        # Fetch a batch of inputs
        return np.array([imread(im) for im in self.im_list[idx * self.bsz: (1 + idx) * self.bsz]])

    def __getitem__(self, idx):
        batch_x = self.get_batch_features(idx)
        batch_y = self.get_batch_labels(idx)
        return batch_x, batch_y

You can pass this object to train your model just like a custom generator: 您可以传递此对象来训练模型,就像自定义生成器一样:

sequence = DataSequence(dataframe, batch_size)
model.fit_generator(sequence, epochs=1, use_multiprocessing=True)

As noted below, it is not required to implement the shuffling logic. 如下所述,不需要实现混洗逻辑。 It suffices to set the shuffle argument to True in the fit_generator() call. fit_generator()调用中将shuffle参数设置为True就足够了。 From the docs : 来自文档

shuffle: Boolean. shuffle:布尔值。 Whether to shuffle the order of the batches at the beginning of each epoch. 是否在每个时代开始时改组批次的顺序。 Only used with instances of Sequence (keras.utils.Sequence). 仅用于Sequence的实例(keras.utils.Sequence)。 Has no effect when steps_per_epoch is not None. 当steps_per_epoch不是None时无效。

I am new to Keras, so take my advice with a grain of salt. 我是Keras的新手,所以请耐心等待我的建议。 I think you should be using a Keras ImageDataGenerator, in particular, the flow_from_dataframe option, since you said you have a Pandas dataframe. 我认为你应该使用flow_from_dataframe ImageDataGenerator,特别是flow_from_dataframe选项,因为你说你有一个Pandas数据帧。 Flow_from_dataframe reads cols of the dataframe to get your filenames and your labels. Flow_from_dataframe读取数据帧的cols以获取文件名和标签。

Below is a snippet of an example. 下面是一个例子的片段。 Look online for tutorials. 在线查看教程。

train_datagen = ImageDataGenerator(horizontal_flip=True,
                                   vertical_flip=False,
                                   rescale=1/255.0)

train_generator = train_datagen.flow_from_dataframe(     
    dataframe=trainDataframe,  
    directory=imageDir,
    x_col="file", # name of col in data frame that contains file names
    y_col=y_col_list, # name of col with labels
    has_ext=True, 
    batch_size=batch_size,
    shuffle=True,
    save_to_dir=saveDir,
    target_size=(img_width,img_height),
    color_mode='grayscale',
    class_mode='categorical', # for classification task
    interpolation='bilinear')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM