简体   繁体   English

为具有内存限制的神经网络训练高效创建 HDF5 图像数据集

[英]Efficiently Create HDF5 Image Dataset for Neural Network Training with Memory Limitations

I am having big image datasets to train CNN's on.我有大的图像数据集来训练 CNN。 Since I cannot load all the images into my RAM I plan to dump them into a HDF5 file (with h5py) and then iterate over the set batchwise, as suggested in由于我无法将所有图像加载到我的 RAM 中,我计划将它们转储到 HDF5 文件中(使用 h5py),然后按照建议批量迭代该集合

Most efficient way to use a large data set for PyTorch? 为 PyTorch 使用大型数据集的最有效方法?

I tried creating an own dataset for every picture, located in the same group, which is very fast.我尝试为位于同一组中的每张图片创建自己的数据集,这非常快。 But I could not figure out to iterate over all datasets in the group, except for accessing the set by its name.但我无法想出遍历组中的所有数据集,除了通过其名称访问该集。 As an alternative I tried putting all the images itereatively into one dataset by extending its shape, according to作为替代方案,我尝试通过扩展其形状将所有图像迭代地放入一个数据集中,根据

How to append data to one specific dataset in a hdf5 file with h5py and 如何使用 h5py 将数据附加到 hdf5 文件中的一个特定数据集

incremental writes to hdf5 with h5py 使用 h5py 增量写入 hdf5

but this is very slow.但这很慢。 Is there a faster way to create a HDF5 dataset to iterate over?有没有更快的方法来创建一个 HDF5 数据集来迭代?

I realize this is an old question, but I found a very helpful resource on this subject that I wanted to share:我意识到这是一个老问题,但我发现了一个关于这个主题的非常有用的资源,我想分享:

https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/ch04.html https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/ch04.html

Basically, the hdf5 (with chunks enabled) is like a little filesystem.基本上,hdf5(启用了块)就像一个小文件系统。 It stores data in chunks scattered throughout memory.它将数据存储在分散在整个内存中的块中。 So like a filesystem, it benefits from locality.因此,就像文件系统一样,它受益于局部性。 If the chunks are the same shape as the array sections you're trying to access, reading/writing will be fast.如果块的形状与您尝试访问的数组部分的形状相同,则读取/写入速度会很快。 If the data you're looking for is scattered across multiple chunks, access will be slow.如果您要查找的数据分散在多个块中,则访问速度会很慢。

So in the case of training a NN on images, you're probably going to have to make the images a standard size.因此,在对图像训练 NN 的情况下,您可能必须将图像设为标准尺寸。 Set chunks=(1,) + image_shape , or even better, chunks=(batch_size,) + image_shape when creating the dataset, and reading/writing will be a lot faster.在创建数据集时设置chunks=(1,) + image_shape ,或者甚至更好, chunks=(batch_size,) + image_shape ,读/写会快很多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM