简体繁体 English

为具有内存限制的神经网络训练高效创建 HDF5 图像数据集

[英]Efficiently Create HDF5 Image Dataset for Neural Network Training with Memory Limitations

原文 2019-03-07 16:32:54 7 1 python/ image/ pytorch/ hdf5/ h5py

I am having big image datasets to train CNN's on.我有大的图像数据集来训练 CNN。 Since I cannot load all the images into my RAM I plan to dump them into a HDF5 file (with h5py) and then iterate over the set batchwise, as suggested in由于我无法将所有图像加载到我的 RAM 中，我计划将它们转储到 HDF5 文件中（使用 h5py），然后按照建议批量迭代该集合

Most efficient way to use a large data set for PyTorch? 为 PyTorch 使用大型数据集的最有效方法？

I tried creating an own dataset for every picture, located in the same group, which is very fast.我尝试为位于同一组中的每张图片创建自己的数据集，这非常快。 But I could not figure out to iterate over all datasets in the group, except for accessing the set by its name.但我无法想出遍历组中的所有数据集，除了通过其名称访问该集。 As an alternative I tried putting all the images itereatively into one dataset by extending its shape, according to作为替代方案，我尝试通过扩展其形状将所有图像迭代地放入一个数据集中，根据

How to append data to one specific dataset in a hdf5 file with h5py and 如何使用 h5py 将数据附加到 hdf5 文件中的一个特定数据集和

incremental writes to hdf5 with h5py 使用 h5py 增量写入 hdf5

but this is very slow.但这很慢。 Is there a faster way to create a HDF5 dataset to iterate over?有没有更快的方法来创建一个 HDF5 数据集来迭代？

1 个解决方案

I realize this is an old question, but I found a very helpful resource on this subject that I wanted to share:我意识到这是一个老问题，但我发现了一个关于这个主题的非常有用的资源，我想分享：

https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/ch04.html https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/ch04.html

Basically, the hdf5 (with chunks enabled) is like a little filesystem.基本上，hdf5（启用了块）就像一个小文件系统。 It stores data in chunks scattered throughout memory.它将数据存储在分散在整个内存中的块中。 So like a filesystem, it benefits from locality.因此，就像文件系统一样，它受益于局部性。 If the chunks are the same shape as the array sections you're trying to access, reading/writing will be fast.如果块的形状与您尝试访问的数组部分的形状相同，则读取/写入速度会很快。 If the data you're looking for is scattered across multiple chunks, access will be slow.如果您要查找的数据分散在多个块中，则访问速度会很慢。

So in the case of training a NN on images, you're probably going to have to make the images a standard size.因此，在对图像训练 NN 的情况下，您可能必须将图像设为标准尺寸。 Set chunks=(1,) + image_shape , or even better, chunks=(batch_size,) + image_shape when creating the dataset, and reading/writing will be a lot faster.在创建数据集时设置chunks=(1,) + image_shape ，或者甚至更好， chunks=(batch_size,) + image_shape ，读/写会快很多。