HDF到numpy的ndarray-快速方法

Question

I am looking for a fast way to set my collection of hdf files into a numpy array where each row is a flattened version of an image. 我正在寻找一种快速的方法来将我的hdf文件集合设置为一个numpy数组，其中每一行都是图像的扁平版本。 What I exactly mean: 我的意思是：

My hdf files store, beside other informations, images per frames. 除其他信息外，我的hdf文件还按帧存储图像。 Each file holds 51 frames with 512x424 images. 每个文件包含512x424图像的51帧。 Now I have 300+ hdf files and I want the image pixels to be stored as one single vector per frame, where all frames of all images are stored in one numpy ndarray. 现在我有300多个hdf文件，并且我希望将图像像素存储为每帧一个矢量，其中所有图像的所有帧都存储在一个numpy ndarray中。 The following picture should help to understand: 下图应有助于理解：

What I got so far is a very slow method, and I actually have no idea how i can make it faster. 到目前为止，我得到的是一个非常缓慢的方法，实际上我不知道如何使它更快。 The problem is that my final array is called too often, as far as I think. 问题是，据我认为，我的最终数组经常调用。 Since I observe that the first files are loaded into the array very fast but speed decreases fast. 由于我观察到第一个文件非常快地加载到阵列中，但是速度却很快降低。 (observed by printing the number of the current hdf file) （通过打印当前hdf文件的编号进行观察）

My current code: 我当前的代码：

os.chdir(os.getcwd()+"\\datasets")

# predefine first row to use vstack later
numpy_data = np.ndarray((1,217088))

# search for all .hdf files
for idx, file in enumerate(glob.glob("*.hdf5")):
  f = h5py.File(file, 'r')
  # load all img data to imgs (=ndarray, but not flattened)
  imgs = f['img']['data'][:]

  # iterate over all frames (50)
  for frame in range(0, imgs.shape[0]):
    print("processing {}/{} (file/frame)".format(idx+1,frame+1))
    data = np.array(imgs[frame].flatten())
    numpy_data = np.vstack((numpy_data, data))

    # delete first row after another is one is stored
    if idx == 0 and frame == 0:
        numpy_data = np.delete(numpy_data, 0,0)

f.close()

For further information, I need this for learning a decision tree. 有关更多信息，我需要它来学习决策树。 Since my hdf file is bigger than my RAM, I think converting into a numpy array save memory and is therefore better suited. 由于我的hdf文件大于RAM，因此我认为转换为numpy数组可以节省内存，因此更适合。

Thanks for every input. 感谢您的每次输入。

Answer 1

I don't think you need to iterate over 我认为您不需要重复

imgs = f['img']['data'][:]

and reshape each 2d array. 并调整每个2d数组的形状。 Just reshape the whole thing. 重塑整个事情。 If I understand your description right, imgs is a 3d array: (51, 512, 424) 如果我理解您的描述正确， imgs是3d数组：（ imgs ）

imgs.reshape(51, 512*424)

should be the 2d equivalent. 应该等于2d。

If you must loop, don't use vstack (or some variant to build a bigger array). 如果必须循环，请不要使用vstack （或某些变体来构建更大的数组）。 One, it is slow, and two it's a pain to cleanup the initial 'dummy' entry. 一是速度慢，二是清理初始“虚拟”条目很痛苦。 Use list appends, and do the stacking once, at the end 使用列表追加，最后进行一次堆叠

alist = []
for frame....
   alist.append(data)
data_array = np.vstack(alist)

vstack (and family) takes a list of arrays as input, so it can work with many at once. vstack （和家族）将数组列表作为输入，因此它可以一次处理多个数组。 List append is much faster when done iteratively. 迭代完成列表附加操作要快得多。

I question whether putting things into one array will help. 我怀疑将事物放入一个数组是否有帮助。 I don't know exactly how the the size of a hdf5 file relates to the size of the downloaded array, but I expect they are in the same order of magnitude. 我不完全知道hdf5文件的大小与下载的阵列的大小hdf5关系，但是我希望它们的大小顺序相同。 So trying to load all 300 files into memory might not work. 因此，尝试将所有300个文件加载到内存中可能无法正常工作。 That's what, 3G of pixels? 那就是3G像素吗？

For an individual file, h5py has provision for loading chunks of an array that is too large to fit in memory. 对于单个文件， h5py可以加载太大而无法容纳在内存中的数组块。 That indicates that often the problem goes the other way, the file holds more than fits. 这表明问题通常是相反的，文件容纳的容量超过了容量。

Is it possible to load large data directly into numpy int8 array using h5py? 是否可以使用h5py将大数据直接加载到numpy int8数组中？

Answer 2

Do you really wan't to load all Images into the RAM and not use a single HDF5-File instead? 您是否真的不希望将所有图像都加载到RAM中并且不使用单个HDF5-File？ Accessing a HDF5-File can be quite fast if you don't make any mistakes (unnessesary fancy indexing, improper chunk-chache-size). 如果您没有犯任何错误（不必要的花式索引，不正确的chunk-chache-size），则访问HDF5-File的速度会非常快。 If you wan't the numpy-way this would be a possibility: 如果您不想麻木，则可能会出现这种情况：

os.chdir(os.getcwd()+"\\datasets")
img_per_file=51

# get all HDF5-Files
files=[]
for idx, file in enumerate(glob.glob("*.hdf5")):
    files.append(file)

# allocate memory for your final Array (change the datatype if your images have some other type)
numpy_data=np.empty((len(files)*img_per_file,217088),dtype=np.uint8)

# Now read all the data
ii=0
for i in range(0,len(files)):
    f = h5py.File(files[0], 'r')
    imgs = f['img']['data'][:]
    f.close()
    numpy_data[ii:ii+img_per_file,:]=imgs.reshape((img_per_file,217088))
    ii=ii+img_per_file

Writing your data to a single HDF5-File would be quite similar: 将数据写入单个HDF5-File将非常相似：

f_out=h5py.File(File_Name_HDF5_out,'w')
# create the dataset (change the datatype if your images have some other type)
dset_out = f_out.create_dataset(Dataset_Name_out, ((len(files)*img_per_file,217088), chunks=(1,217088),dtype='uint8')

# Now read all the data
ii=0
for i in range(0,len(files)):
    f = h5py.File(files[0], 'r')
    imgs = f['img']['data'][:]
    f.close()
    dset_out[ii:ii+img_per_file,:]=imgs.reshape((img_per_file,217088))
    ii=ii+img_per_file

f_out.close()

If you only wan't to access whole images afterwards the chunk-size should be okay. 如果之后您仅不希望访问整个图像，则块大小应该可以。 If not you have to change that to your needs. 如果不是，您必须将其更改为您的需求。

What you should do when accessing a HDF5-File: 访问HDF5-文件时应执行的操作：

Use a chunk-size, which fits your needs. 使用适合您需要的块大小。
Set a proper chunk-chache-size. 设置适当的chunk-chache-size。 This can be done with the h5py low level api or h5py_cache. 可以使用h5py低级api或h5py_cache。 https://pypi.python.org/pypi/h5py-cache/1.0 https://pypi.python.org/pypi/h5py-cache/1.0
Avoid any type of fancy indexing. 避免使用任何形式的花式索引。 If your Dataset has n dimensions access it in a way that the returned Array has also n dimensions. 如果您的数据集具有n个维度，则以返回的数组也具有n个维度的方式对其进行访问。
```
 # Chunk size is [50,50] and we iterate over the first dimension numpyArray=h5_dset[i,:] #slow numpyArray=np.squeeze(h5_dset[i:i+1,:]) #does the same but is much faster 
```

EDIT This shows how to read your data to a memmaped numpy array. 编辑这显示了如何将数据读取到memmaped numpy数组。 I think your method expects data of format np.float32. 我认为您的方法需要使用np.float32格式的数据。 https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html#numpy.memmap https://docs.scipy.org/doc/numpy/reference/generation/numpy.memmap.html#numpy.memmap

 numpy_data = np.memmap('Your_Data.npy', dtype='np.float32', mode='w+', shape=((len(files)*img_per_file,217088)))

Everything else could be kept the same. 其他所有内容都可以保持不变。 If it works I would also recommend to use a SSD instead of a hardisk. 如果可行，我也建议您使用SSD而不是Hardisk。

HDF到numpy的ndarray-快速方法

问题描述

2 个解决方案

解决方案1
1 已采纳 2017-03-29 15:49:32

解决方案2
1 2017-03-29 23:26:12

HDF到numpy的ndarray-快速方法

问题描述

2 个解决方案

解决方案1 1 已采纳 2017-03-29 15:49:32

解决方案2 1 2017-03-29 23:26:12

解决方案1
1 已采纳 2017-03-29 15:49:32

解决方案2
1 2017-03-29 23:26:12