在Python中对大图像数据集进行多处理

Question

I have a very large image dataset (>50G, single images in a folder) for training, to make loading of images more efficient, I firstly load parts of the images onto RAM and then send small batches to GPU for training. 我有一个非常大的图像数据集（> 50G，一个文件夹中有单个图像）用于训练，为了使图像加载更加高效，我首先将部分图像加载到RAM中，然后将小批量发送给GPU进行训练。

I want to further speed up the data preparation process before feeding the images to the GPU and was thinking about multi-processing. 我想进一步加快数据准备过程， 然后再将图像提供给GPU，并正在考虑进行多处理。 But I'm not sure how should I do it, any ideas? 但是我不确定应该怎么做，有什么想法吗？

Answer 1

For speed I would advise to used HDF5 or LMDB: 为了提高速度，我建议您使用HDF5或LMDB：

I have successfully used ml-pyxis for creating deep learning datasets using LMDBs. 我已成功使用ml-pyxis使用LMDB创建深度学习数据集。

It allows to create binary blobs (LMDB) and they can be read quite fast. 它允许创建二进制Blob（LMDB），并且可以非常快速地读取它们。 The link above comes with some simple examples on how to create and read the data. 上面的链接附带了一些有关如何创建和读取数据的简单示例。 Including python generators/ iteratos 包括python generators / iteratos

For multi-processing: 对于多处理：

I personally work with Keras, and by using a python generator it is possible train with mutiple-processing for data using the fit_generator method. 我亲自与Keras一起工作，通过使用python生成器，可以使用fit_generator方法对数据进行多重处理。

fit_generator(self, generator, samples_per_epoch,
              nb_epoch, verbose=1, callbacks=[],
              validation_data=None, nb_val_samples=None,
              class_weight={}, max_q_size=10, nb_worker=1,
              pickle_safe=False)

Fits the model on data generated batch-by-batch by a Python generator. 使模型适合Python生成器逐批生成的数据。 The generator is run in parallel to the model, for efficiency. 生成器与模型并行运行以提高效率。 For instance, this allows you to do real-time data augmentation on images on CPU in parallel to training your model on GPU. 例如，这允许您并行地对CPU上的图像进行实时数据增强，以在GPU上训练模型。 You can find the source code here , and the documentation here . 您可以在此处找到源代码，并在此处找到文档。

Answer 2

Don't know whether you prefer tensorflow/keras/torch/caffe whatever. 不知道您是否更喜欢tensorflow / keras / torch / caffe。

Multiprocessing is simply Using Multiple GPUs 多处理就是使用多个GPU

Basically you are trying to leverage more hardware by delegating or spawning one child process for every GPU and let them do their magic. 基本上，您正在尝试通过为每个GPU委派或生成一个子进程来利用更多的硬件，并让它们发挥作用。 The example above is for Logistic Regression . 上面的示例适用于Logistic回归 。

Of course you would be more keen on looking into Convnets - This LSU Material (Pgs 48-52[Slides 11-14]) builds some intuition 当然，您会更热衷于研究卷积网络-这种LSU材料 （第48-52页，幻灯片11-14）建立了一些直觉

Keras is yet to officially provide support but you can "proceed at your own risk" Keras尚未正式提供支持，但您可以“自行承担风险”

For multiprocessing, tensorflow is a better way to go about this (my opinion) In fact they have some good documentation on it too 对于多处理，tensorflow是解决此问题的更好方法（我认为）事实上，他们也有一些很好的文档

在Python中对大图像数据集进行多处理

问题描述

2 个解决方案

解决方案1
1 已采纳 2016-11-26 18:09:00

解决方案2
0 2016-11-24 19:47:50

在Python中对大图像数据集进行多处理

问题描述

2 个解决方案

解决方案1 1 已采纳 2016-11-26 18:09:00

解决方案2 0 2016-11-24 19:47:50

解决方案1
1 已采纳 2016-11-26 18:09:00

解决方案2
0 2016-11-24 19:47:50