简体   繁体   English

将嵌套的 .h5 组读入 numpy 数组

[英]reading nested .h5 group into numpy array

I received this .h5 file from a friend and I need to use the data in it for some work.我从朋友那里收到了这个 .h5 文件,我需要将其中的数据用于一些工作。 All the data is numerical.所有数据都是数字。 This the first time I work with these kind of files.这是我第一次使用这些类型的文件。 I found many questions and answers here about reading these files but I couldn't find a way to get to lower level of the groups or folders the file contains.我在这里找到了许多关于阅读这些文件的问题和答案,但我找不到一种方法来获得文件包含的组或文件夹的较低级别。 The file contains two main folders, ie X and YX contains a folder named 0 which contains two folders named A and B. Y contains ten folders named 1-10.该文件包含两个主要文件夹,即 X 和 YX 包含名为 0 的文件夹,其中包含名为 A 和 B 的两个文件夹。Y 包含名为 1-10 的十个文件夹。 The data I want to read is in A,B,1,2,..,10 for instance I start with我想读取的数据在 A,B,1,2,..,10 中,例如我从

f = h5py.File(filename, 'r')
f.keys()

Now f returns [u'X', u'Y'] The two main folders现在 f 返回[u'X', u'Y']两个主要文件夹

Then I try to read X and Y using read_direct but I get the error然后我尝试使用 read_direct 读取 X 和 Y 但我收到错误

AttributeError: 'Group' object has no attribute 'read_direct' AttributeError: 'Group' 对象没有属性 'read_direct'

I try to create an object for X and Y as follows我尝试为 X 和 Y 创建一个对象,如下所示

obj1 = f['X']

obj2 = f['Y']

Then if I use command like然后如果我使用命令

obj1.shape
obj1.dtype 

I get an error我收到一个错误

AttributeError: 'Group' object has no attribute 'shape' AttributeError: 'Group' 对象没有属性 'shape'

I can see that these command don't work because I use then on X and Y which are folders contains no data but other folders.我可以看到这些命令不起作用,因为我在 X 和 Y 上使用 then,这些文件夹不包含数据,但包含其他文件夹。

So my question is how to get down to the folders named A, B,1-10 to read the data所以我的问题是如何深入到名为 A, B,1-10 的文件夹来读取数据

I couldn't find a way to do that even in the documentation http://docs.h5py.org/en/latest/quick.html即使在文档http://docs.h5py.org/en/latest/quick.html 中,我也找不到办法做到这一点

You need to traverse down your HDF5 hierarchy until you reach a dataset.您需要向下遍历 HDF5 层次结构,直到到达数据集。 Groups do not have a shape or type, datasets do.组没有形状或类型,数据集有。

Assuming you do not know your hierarchy structure in advance, you can use a recursive algorithm to yield, via an iterator, full paths to all available datasets in the form group1/group2/.../dataset .假设您事先不知道您的层次结构,您可以使用递归算法通过迭代器以group1/group2/.../dataset的形式生成所有可用数据集的完整路径。 Below is an example.下面是一个例子。

import h5py

def traverse_datasets(hdf_file):

    def h5py_dataset_iterator(g, prefix=''):
        for key in g.keys():
            item = g[key]
            path = f'{prefix}/{key}'
            if isinstance(item, h5py.Dataset): # test for dataset
                yield (path, item)
            elif isinstance(item, h5py.Group): # test for group (go down)
                yield from h5py_dataset_iterator(item, path)

    for path, _ in h5py_dataset_iterator(hdf_file):
        yield path

You can, for example, iterate all dataset paths and output attributes which interest you:例如,您可以迭代您感兴趣的所有数据集路径和输出属性:

with h5py.File(filename, 'r') as f:
    for dset in traverse_datasets(f):
        print('Path:', dset)
        print('Shape:', f[dset].shape)
        print('Data type:', f[dset].dtype)

Remember that, by default, arrays in HDF5 are not read entirely in memory.请记住,默认情况下,HDF5 中的数组不会完全在内存中读取。 You can read into memory via arr = f[dset][:] , where dset is the full path.您可以通过arr = f[dset][:]读入内存,其中dset是完整路径。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM