[英]Efficiently Create HDF5 Image Dataset for Neural Network Training with Memory Limitations
I am having big image datasets to train CNN's on.我有大的图像数据集来训练 CNN。 Since I cannot load all the images into my RAM I plan to dump them into a HDF5 file (with h5py) and then iterate over the set batchwise, as suggested in
由于我无法将所有图像加载到我的 RAM 中,我计划将它们转储到 HDF5 文件中(使用 h5py),然后按照建议批量迭代该集合
Most efficient way to use a large data set for PyTorch? 为 PyTorch 使用大型数据集的最有效方法?
I tried creating an own dataset for every picture, located in the same group, which is very fast.我尝试为位于同一组中的每张图片创建自己的数据集,这非常快。 But I could not figure out to iterate over all datasets in the group, except for accessing the set by its name.
但我无法想出遍历组中的所有数据集,除了通过其名称访问该集。 As an alternative I tried putting all the images itereatively into one dataset by extending its shape, according to
作为替代方案,我尝试通过扩展其形状将所有图像迭代地放入一个数据集中,根据
How to append data to one specific dataset in a hdf5 file with h5py and 如何使用 h5py 将数据附加到 hdf5 文件中的一个特定数据集和
incremental writes to hdf5 with h5py 使用 h5py 增量写入 hdf5
but this is very slow.但这很慢。 Is there a faster way to create a HDF5 dataset to iterate over?
有没有更快的方法来创建一个 HDF5 数据集来迭代?
I realize this is an old question, but I found a very helpful resource on this subject that I wanted to share:我意识到这是一个老问题,但我发现了一个关于这个主题的非常有用的资源,我想分享:
https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/ch04.html https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/ch04.html
Basically, the hdf5 (with chunks enabled) is like a little filesystem.基本上,hdf5(启用了块)就像一个小文件系统。 It stores data in chunks scattered throughout memory.
它将数据存储在分散在整个内存中的块中。 So like a filesystem, it benefits from locality.
因此,就像文件系统一样,它受益于局部性。 If the chunks are the same shape as the array sections you're trying to access, reading/writing will be fast.
如果块的形状与您尝试访问的数组部分的形状相同,则读取/写入速度会很快。 If the data you're looking for is scattered across multiple chunks, access will be slow.
如果您要查找的数据分散在多个块中,则访问速度会很慢。
So in the case of training a NN on images, you're probably going to have to make the images a standard size.因此,在对图像训练 NN 的情况下,您可能必须将图像设为标准尺寸。 Set
chunks=(1,) + image_shape
, or even better, chunks=(batch_size,) + image_shape
when creating the dataset, and reading/writing will be a lot faster.在创建数据集时设置
chunks=(1,) + image_shape
,或者甚至更好, chunks=(batch_size,) + image_shape
,读/写会快很多。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.