使用 Python 查找 HDF5 文件中特定值的所有路径的最有效方法是什么？

Question

I am looking for negative values in a .hdf5 file that has the following architecture:我正在寻找具有以下架构的 .hdf5 文件中的负值：

- Incidence_0
   - Wavelength_0
      - (Table of size m * n)
   - Wavelength_1
      - (Table of size m * n)
   ...
- Incidence_1
   ...
...

My objective is to find every negative value, and to get back its exact position in the file (ie, the number of the incidence, the number of the wavelength, and its position in the associated table).我的目标是找到每个负值，并取回它在文件中的确切位置（即入射数、波长数及其在相关表中的位置）。

I am sorry that I cannot give and minimal reproducible example because I cannot give the file that I'm using, but here is the idea.很抱歉，我不能给出最小可重复的例子，因为我不能给出我正在使用的文件，但这是我的想法。

import h5py

file = h5py.File('testFile.hdf5', 'r')

result = []

for incidence in range(nbIncidence):
    for wavelength in range(nbWavelength):
        for theta in range(nbTheta):
            for phi in range(nbPhi):
                value = file['Incidence_' + str(incidence)]['Wavelength_' + str(wavelength)][theta, phi]

                if (value < 0):
                    result.append([value, incidence, wavelength, theta, phi])

This is perfectly working, but using four loops is time-consuming, especially if I have to work on huge files, that may probably happen... I don't know enough the h5py library but I am pretty sure that it exists a way to do this way faster than that.这是完美的工作，但使用四个循环非常耗时，特别是如果我必须处理大文件，这可能会发生......我对 h5py 库了解不够，但我很确定它存在一种方式这样做比那更快。

Answer 1

First, the bad news: h5py doesn't have a function to interrogate your data in the way you described.首先，坏消息： h5py没有以您描述的方式询问您的数据的功能。 The good news: you can accomplish your task by extracting each Incident/Wavelength dataset to an NumPy array, then combining 2 NumPy methods to operate on the extracted array.好消息：您可以通过将每个事件/波长数据集提取到 NumPy 数组中来完成您的任务，然后结合 2 个 NumPy 方法对提取的数组进行操作。 [Note: This assumes you have sufficient memory to load each dataset.] [注意：这假设您有足够的内存来加载每个数据集。]

Some observations on working with this data (to help you follow my example).关于处理这些数据的一些观察（以帮助您遵循我的示例）。

The HDF5 file schema is self-describing. HDF5 文件架构是自描述的。 So you don't need to iterate over integer counters.所以你不需要遍历整数计数器。 Instead, you can get group and dataset names with the .keys() method.相反，您可以使用.keys()方法获取组和数据集名称。 (Or, you can get (name, object) tuples with the .items() method.) （或者，您可以使用.items()方法获取(name, object)元组。）
I highly recommend using Python's file context manager to be sure you don't leave the file open when you exit.我强烈建议使用 Python 的文件上下文管理器，以确保在退出时不会打开文件。
Read data from the dataset into a numpy array with standard numpy slicing notation.使用标准 numpy 切片表示法将数据集中的数据读取到 numpy 数组中。 An empty tuple retrieves all data.一个空元组检索所有数据。 Use the following with your schema: arr = file['Incidence_#']['Wavelength_#'][()]将以下内容与您的架构一起使用： arr = file['Incidence_#']['Wavelength_#'][()]
Create a new boolean array based on your specified criteria.根据您指定的条件创建一个新的布尔数组。 Using arr < 0 will return True for all negative values (and False for other values).使用arr < 0将对所有负值返回True （对其他值返回False ）。
Use np.argwhere to find array element indices that are non-zero.使用np.argwhere查找非零的数组元素索引。 Use this on the boolean array (remembering True is non-zero, and False is zero).在布尔数组上使用它（记住 True 是非零，而 False 是零）。
From there, you can loop over the indices to extract the data you want.从那里，您可以遍历索引以提取所需的数据。 :-) :-)

I created a simple example that mimics your schema to demonstrate the process.我创建了一个模仿您的模式的简单示例来演示该过程。 For completeness, that code is at the end.为了完整起见，该代码位于最后。 It's a small file that won't bury you in output.这是一个小文件，不会把你埋没在输出中。

Code below reads the data, finds negative values, and adds data to a list.下面的代码读取数据，找到负值，并将数据添加到列表中。 It has several print statements so you can see how each step works.它有几个打印语句，所以你可以看到每个步骤是如何工作的。 They aren't need once you are confident in the procedure.一旦您对程序有信心，就不需要它们。

with h5py.File('testFile.hdf5', 'r') as h5fr:
    result = []
    
    for i_grp in h5fr.keys():
        for wave_ds in h5fr[i_grp].keys():
            wave_arr = h5fr[i_grp][wave_ds][()]
            neg_idx = np.argwhere(wave_arr < 0.0)
            wave_res = []
            for n in neg_idx:
                i, j = n[0], n[1]
                result.append([wave_arr[i,j], i_grp, wave_ds, i, j])
                wave_res.append([wave_arr[i,j], i_grp, wave_ds, i, j])
            print(f'\nResults for {i_grp}; {wave_ds}:')    
            print(wave_res)

Code to create the example file used above:创建上面使用的示例文件的代码：

nbIncidence = 4
nbWavelength = 6
m, n = 10, 10 # m,n same as nbTheta, nbPhi??

with h5py.File('testFile.hdf5', 'w') as h5fw:
    for i_cnt in range(nbIncidence):
        grp = h5fw.create_group('Incidence_' + str(i_cnt))
        for w_cnt in range(nbWavelength):
            arr = np.random.uniform(low=-1.0, high=10.0, size=(m,n)) #.reshape(m,n)
            grp.create_dataset('Wavelength_' + str(w_cnt), data=arr)

使用 Python 查找 HDF5 文件中特定值的所有路径的最有效方法是什么？

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-05-17 18:06:19

使用 Python 查找 HDF5 文件中特定值的所有路径的最有效方法是什么？

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-05-17 18:06:19

解决方案1
1 已采纳 2022-05-17 18:06:19