简体   繁体   English

使用 memory 约束制作 h5py 文件的有效方法

[英]Efficient way to make h5py file with memory constraint

Let's say I have image like below:假设我有如下图像:

root
|___dog
|    |___img1.jpg
|    |___img2.jpg
|    |___...
|    
|___cat
|___...

I want to make image files to h5py files.我想将图像文件制作成 h5py 文件。

First, I tried to read all image files and make it to h5 file.首先,我尝试读取所有图像文件并将其转换为 h5 文件。

import os
import numpy as np
import h5py
import PIL.Image as Image



datafile = h5py.File(data_path, 'w')


label_list = os.listdir('root')
for i, label in enumerate(label_list):
    files = os.listdir(os.path.join('root', label_list))
    for filename in files:
        img = Image.open(os.path.join('root', label, filename))
        ow, oh = 128, 128
        img = img.resize((ow, oh), Image.BILINEAR)
        data_x.append(np.array(img).tolist())
        data_y.append(i)


datafile = h5py.File(data_path, 'w')
datafile.create_dataset("data_image", dtype='uint8', data=data_x)
datafile.create_dataset("data_label", dtype='int64', data=data_y)

But I can't make it because of the memory constraint (Each folder have image more than 200,000 with 224x224 size).但由于 memory 约束,我无法做到(每个文件夹的图像超过 200,000 个,大小为 224x224)。

So, what is the best way to make this image to h5 file?那么,将此图像制作为 h5 文件的最佳方法是什么?

The HDF5/h5py dataset objects have a much smaller memory footprint than the same size NumPy array.与相同大小的 NumPy 数组相比,HDF5/h5py 数据集对象的 memory 占用空间要小得多。 (That's one advantage to using HDF5.) You can create the HDF5 file and allocate the datasets BEFORE you start looping on the image files. (这是使用 HDF5 的一个优势。)您可以在开始循环图像文件之前创建 HDF5 文件并分配数据集。 Then you can operate on the images one at a time (read, resize, and write image 0, then image 1, etc).然后您可以一次对图像进行操作(读取、调整大小和写入图像 0,然后是图像 1,等等)。

The code below creates the necessary datasets presized for 200,000 images.下面的代码为 200,000 张图像创建了必要的数据集。 The code logic is rearranged to work as I described.代码逻辑被重新安排以按照我描述的方式工作。 img_cnt variable used to position new image data in existing datasets. img_cnt变量用于 position 现有数据集中的新图像数据。 (Note: I think this works as written. However without the data, I couldn't test, so it may need minor tweaking.) If you want to adjust the dataset sizes in the future, you can add the maxshape=() parameter to the create_dataset() function. (注意:我认为这是写的。但是没有数据,我无法测试,所以它可能需要稍微调整。)如果你想在未来调整数据集大小,你可以添加maxshape=()参数到create_dataset() function。

# Open HDF5 and create datasets in advance
datafile = h5py.File(data_path, 'w')
datafile.create_dataset("data_image", (200000,224,224), dtype='uint8')
datafile.create_dataset("data_label", (200000,), dtype='int64')

label_list = os.listdir('root')
img_cnt = 0
for i, label in enumerate(label_list):
    files = os.listdir(os.path.join('root', label_list))
    for filename in files:
        img = Image.open(os.path.join('root', label, filename))
        ow, oh = 128, 128
        img = img.resize((ow, oh), Image.BILINEAR)
        datafile["data_image"][img_cnt,:,:] = np.array(img).tolist())
        datafile["data_label"][img_cnt] = i
        img_cnt += 1

datafile.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM