如何将一个大的 HDF5 文件拆分为多个小的 HDF5 数据集

Question

I have a big HDF5 file with the images and its corresponding ground truth density map.我有一个大的 HDF5 文件，其中包含图像及其相应的地面实况密度 map。 I want to put them into the network CRSNet and it requires the images in separate files.我想将它们放入网络 CRSNet 中，它需要单独文件中的图像。 How can I achieve that?我怎样才能做到这一点？ Thank you very much.非常感谢。

-- Basic info I have a HDF5 file with two keys "images" and "density_maps". -- 基本信息我有一个带有两个键“images”和“density_maps”的 HDF5 文件。 Their shapes are (300, 380, 676, 1).它们的形状是 (300, 380, 676, 1)。 300 stands for the number of images, 380 and 676 refer to the height and width respectively. 300 代表图像数量，380 和 676 分别代表高度和宽度。

-- What I need to put into the CRSNet network are the images (jpg) with their corresponding HDF5 files. -- 我需要放入 CRSNet 网络的是带有相应 HDF5 文件的图像 (jpg)。 The shape of them would be (572, 945).它们的形状是 (572, 945)。

Thanks a lot for any comment and discussion!非常感谢您的任何评论和讨论！

Answer 1

For starters, a quick clarification on h5py and HDF5.对于初学者，快速澄清 h5py 和 HDF5。 h5py is a Python package to read HDF5 files. h5py 是一个 Python package 来读取 HDF5 文件。 You can also read HDF5 files with the PyTables package (and with other languages: C, C++, FORTRAN).您还可以使用 PyTables package（以及其他语言：C、C++、FORTRAN）读取 HDF5 文件。

I'm not entirely sure what you mean by " the images (jpg) with their corresponding h5py (HDF5) files " As I understand all of your data is in 1 HDF5 file.我不完全确定“图像 (jpg) 及其相应的 h5py (HDF5) 文件”是什么意思，据我了解，您的所有数据都在 1 个 HDF5 文件中。 Also, I don't understand what you mean by: " The shape of them would be (572, 945). " This is different from the image data, right?另外，我不明白您的意思是：“它们的形状将是 (572, 945)。 ” 这与图像数据不同，对吧？ Please update your post to clarify these items.请更新您的帖子以澄清这些项目。

It's relatively easy to extract data from a dataset.从数据集中提取数据相对容易。 This is how you can get the "images" as NumPy arrays and and use cv2 to write as individual jpg files.这就是您如何获得“图像”为 NumPy arrays 并使用 cv2 将其写入单个 jpg 文件。 See code below:请参见下面的代码：

with h5py.File('yourfile.h5','r') as h5f:
    for i in range(h5f['images'].shape[0]):
        image_arr = h5f['images'][i,:]   # slice notation gets [i,:,:,:]
        cv2.imwrite(f'test_img_{i:03}.jpg',img_arr)

Before you start coding, are you sure you need the images as individual image files, or individual image data (usually NumPy arrays)?在开始编码之前，您确定需要将图像作为单独的图像文件还是单独的图像数据（通常是 NumPy 数组）？ I ask because the first step in most CNN processes is reading the images and converting them to arrays for downstream processing.我问是因为大多数 CNN 处理的第一步是读取图像并将它们转换为 arrays 以进行下游处理。 You already have the arrays in the HDF5 file.您已经在 HDF5 文件中有 arrays。 All you may need to do is read each array and save to the appropriate data structure for CRSNet to process them.您可能需要做的就是读取每个数组并保存到适当的数据结构中，以便 CRSNet 处理它们。 For example, here is the code to create a list of arrays (used by TensorFlow and Keras):例如，这里是创建 arrays 列表的代码（由 TensorFlow 和 Keras 使用）：

image_list = []
with h5py.File('yourfile.h5','r') as h5f:
    for i in range(h5f['images'].shape[0]):
        image_list.append( h5f['images'][i,:] )  # gets slice [i,:,:,:]

如何将一个大的 HDF5 文件拆分为多个小的 HDF5 数据集

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-03-30 17:35:14

如何将一个大的 HDF5 文件拆分为多个小的 HDF5 数据集

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-03-30 17:35:14

解决方案1
2 已采纳 2021-03-30 17:35:14