简体   繁体   English

如何通过 h5py 读取 v7.3 mat 文件?

[英]How to read a v7.3 mat file via h5py?

I have a struct array created by matlab and stored in v7.3 format mat file:我有一个由 matlab 创建并存储在 v7.3 格式的 mat 文件中的结构数组:

struArray = struct('name', {'one', 'two', 'three'}, 
                   'id', {1,2,3}, 
                   'data', {[1:10], [3:9], [0]})
save('test.mat', 'struArray', '-v7.3')

Now I want to read this file via python using h5py:现在我想使用 h5py 通过 python 读取这个文件:

data = h5py.File('test.mat')
struArray = data['/struArray']

I have no idea how to get the struct data one by one from struArray :我不知道如何从struArray一个一个地获取结构数据:

for index in range(<the size of struArray>):
    elem = <the index th struct in struArray>
    name = <the name of elem>
    id = <the id of elem>
    data = <the data of elem>

Matlab 7.3 file format is not extremely easy to work with h5py. 使用matlab 7.3文件格式并不是非常容易使用h5py。 It relies on HDF5 reference, cf. 它依赖于HDF5参考,参见 h5py documentation on references . 关于参考文献的h5py文档

>>> import h5py
>>> f = h5py.File('test.mat')
>>> list(f.keys())
['#refs#', 'struArray']
>>> struArray = f['struArray']
>>> struArray['name'][0, 0]  # this is the HDF5 reference
<HDF5 object reference>
>>> f[struArray['name'][0, 0]].value  # this is the actual data
array([[111],
       [110],
       [101]], dtype=uint16)

To read struArray(i).id : 阅读struArray(i).id

>>> f[struArray['id'][0, 0]][0, 0]
1.0
>>> f[struArray['id'][1, 0]][0, 0]
2.0
>>> f[struArray['id'][2, 0]][0, 0]
3.0

Notice that Matlab stores a number as an array of size (1, 1), hence the final [0, 0] to get the number. 请注意,Matlab将数字存储为大小(1,1)的数组,因此最终[0, 0]得到数字。

To read struArray(i).data : 要读取struArray(i).data

>>> f[struArray['data'][0, 0]].value
array([[  1.],
       [  2.],
       [  3.],
       [  4.],
       [  5.],
       [  6.],
       [  7.],
       [  8.],
       [  9.],
       [ 10.]])

To read struArray(i).name , it is necessary to convert the array of integers to string: 要读取struArray(i).name ,必须将整数数组转换为字符串:

>>> f[struArray['name'][0, 0]].value.tobytes()[::2].decode()
'one'
>>> f[struArray['name'][1, 0]].value.tobytes()[::2].decode()
'two'
>>> f[struArray['name'][2, 0]].value.tobytes()[::2].decode()
'three'

visit or visititems is quick way of seeing the overall structure of a h5py file: visitvisititems是查看h5py文件整体结构的快捷方式:

fs['struArray'].visititems(lambda n,o:print(n, o))

When I run this on a file produced by Octave save -hdf5 I get: 当我在Octave save -hdf5生成的文件上运行时,我得到:

type <HDF5 dataset "type": shape (), type "|S7">
value <HDF5 group "/struArray/value" (3 members)>
value/data <HDF5 group "/struArray/value/data" (2 members)>
value/data/type <HDF5 dataset "type": shape (), type "|S5">
value/data/value <HDF5 group "/struArray/value/data/value" (4 members)>
value/data/value/_0 <HDF5 group "/struArray/value/data/value/_0" (2 members)>
value/data/value/_0/type <HDF5 dataset "type": shape (), type "|S7">
value/data/value/_0/value <HDF5 dataset "value": shape (10, 1), type "<f8">
value/data/value/_1 <HDF5 group "/struArray/value/data/value/_1" (2 members)>
...
value/data/value/dims <HDF5 dataset "dims": shape (2,), type "<i4">
value/id <HDF5 group "/struArray/value/id" (2 members)>
value/id/type <HDF5 dataset "type": shape (), type "|S5">
value/id/value <HDF5 group "/struArray/value/id/value" (4 members)>
value/id/value/_0 <HDF5 group "/struArray/value/id/value/_0" (2 members)>
...
value/id/value/_2/value <HDF5 dataset "value": shape (), type "<f8">
value/id/value/dims <HDF5 dataset "dims": shape (2,), type "<i4">
value/name <HDF5 group "/struArray/value/name" (2 members)>
...
value/name/value/dims <HDF5 dataset "dims": shape (2,), type "<i4">

This may not be the same what MATLAB 7.3 produces, but it gives an idea of a structure's complexity. 这可能与MATLAB 7.3产生的不同,但它给出了结构复杂性的概念。

A more refined callback can display values, and could be the starting point for recreating a Python object (dictionary, lists, etc). 更精细的回调可以显示值,并且可以是重新创建Python对象(字典,列表等)的起点。

def callback(name, obj):
    if name.endswith('type'):
        print('type:', obj.value)
    elif name.endswith('value'):
        if type(obj).__name__=='Dataset':
            print(obj.value.T)  # http://stackoverflow.com/questions/21624653
    elif name.endswith('dims'):
        print('dims:', obj.value)
    else:
        print('name:', name)

fs.visititems(callback)

produces: 生产:

name: struArray
type: b'struct'
name: struArray/value/data
type: b'cell'
name: struArray/value/data/value/_0
type: b'matrix'
[[  1.   2.   3.   4.   5.   6.   7.   8.   9.  10.]]
name: struArray/value/data/value/_1
type: b'matrix'
[[ 3.  4.  5.  6.  7.  8.  9.]]
name: struArray/value/data/value/_2
type: b'scalar'
0.0
dims: [3 1]
name: struArray/value/id
type: b'cell'
name: struArray/value/id/value/_0
type: b'scalar'
1.0
...
dims: [3 1]
name: struArray/value/name
type: b'cell'
name: struArray/value/name/value/_0
type: b'sq_string'
[[111 110 101]]
...
dims: [3 1]

I would start by firing up the interpreter and running help on struarray . 我会先启动解释器并在struarray上运行help It should give you enough information to get you started. 它应该为您提供足够的信息来帮助您入门。 Failing that, you can dump the attributes of any Python object by print ing the __dict__ attribute. 如果做不到这一点,您可以通过print __dict__属性来转储任何Python对象的属性。

I'm sorry but I think it will be quite challenging to get contents of cells/structures from outside Matlab. 对不起,我认为从Matlab外部获取单元格/结构的内容将非常具有挑战性。 If you view the produced files (eg with HDFView) you will see there are lots of cross-references and no obvious way to proceed. 如果您查看生成的文件(例如使用HDFView),您会看到有很多交叉引用,没有明显的方法可以继续。

If you stick to simple numerical arrays it works fine. 如果你坚持使用简单的数值数组就行了。 If you have small cell arrays containing numerical arrays you can convert them to seperate variables (ie cellcontents1, cellcontents2 etc.) which is usually just a few lines and allows them to be saved and loaded directly. 如果你有包含数值数组的小单元格数组,你可以将它们转换为单独的变量(即cellcontents1,cellcontents2等),这些变量通常只有几行,并允许它们直接保存和加载。 So in your example I would save a file with vars name1, name2, name3, id1, id2, id3 ... etc. 因此,在您的示例中,我将使用vars name1, name2, name3, id1, id2, id3 ...等保存文件。

EDIT: You specified h5py in the question so thats what I answered, but worth mentioning that with scipy.io.loadmat you should be able to get the original variables converted to numpy equivalents (eg object arrays). 编辑:你在问题中指定了h5py,这就是我所回答的,但值得一提的是,使用scipy.io.loadmat你应该能够将原始变量转换为numpy等价物(例如对象数组)。

I know of two solutions (one of which I made and works better if the *.mat file is very large or very deep) that abstracts away your direct interactions with the h5py library.我知道有两种解决方案(如果*.mat文件非常大或非常深,我制作了其中一种并且效果更好)可以抽象出您与h5py库的直接交互。

  • the hdf5storage package, which is well maintained and meant to help load v7.3 saved matfiles into Python hdf5storage package,维护良好,旨在帮助将 v7.3 保存的 matfiles 加载到 Python
  • my own matfile loader , which I wrote to overcome certain problems even the latest version ( 0.2.0 ) of hdf5storage has loading large (~500Mb) and/or deep arrays (I'm actually not sure which of the two causes the issue)我自己的 matfile loader ,我写它来克服某些问题,即使是hdf5storage的最新版本( 0.2.0 )也加载大(~500Mb)和/或深 arrays (我实际上不确定这两个原因中的哪一个导致了问题)

Assuming you've downloaded both packages into a place where you can load them into Python, you can see that they produce similar outputs for your example 'test.mat' :假设您已将这两个包下载到可以将它们加载到 Python 的位置,您可以看到它们为您的示例'test.mat'生成了类似的输出:

In [1]: pyInMine = LoadMatFile('test.mat')
In [2]: pyInHdf5 = hdf5.loadmat('test.mat')  
In [3]: pyInMine()                                                                                                                                          
Out[3]: dict_keys(['struArray'])
In [4]: pyInMine['struArray'].keys()                                                                                                                             
Out[4]: dict_keys(['data', 'id', 'name'])
In [5]: pyInHdf5.keys()                                                                                                                                      
Out[5]: dict_keys(['struArray'])
In [6]: pyInHdf5['struArray'].dtype                                                                                                                          
Out[6]: dtype([('name', 'O'), ('id', '<f8', (1, 1)), ('data', 'O')])
In [7]: pyInHdf5['struArray']['data']                                                                                                                        
Out[7 ]: 
array([[array([[ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.]]),
        array([[3., 4., 5., 6., 7., 8., 9.]]), array([[0.]])]],
      dtype=object)
In [8]: pyInMine['struArray']['data']                                                                                                                            
Out[8]: 
array([[array([[ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.]]),
        array([[3., 4., 5., 6., 7., 8., 9.]]), array([[0.]])]],
      dtype=object)

The big difference is that my library converts structure arrays in Matlab into Python dictionaries whose keys are the structure's fields, whereas hdf5storage converts them into numpy object arrays with various dtypes storing the fields.最大的区别是我的库将 Matlab 中的结构 arrays 转换为 Python 字典,其键是结构的字段,而hdf5storage将它们转换为numpy object arrays 存储具有各种dtypes的字段。

I also note that the indexing behavior of the array is different from how you would expect it from the Matlab approach.我还注意到数组的索引行为与您对 Matlab 方法的期望不同。 Specifically, in Matlab, in order to get the name field of the second structure, you would index the structure :具体来说,在 Matlab 中,为了获取第二个结构的name字段,您将对该结构进行索引:

[Matlab] >> struArray(2).name`
[Matlab] >> 'two'

In my package, you have to first grab the field and then index:在我的package中,要先抓取字段索引:

In [9]: pyInMine['struArray'].shape                                                                                                                              
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-64-a2f85945642b> in <module>
----> 1 pyInMine['struArray'].shape

AttributeError: 'dict' object has no attribute 'shape'
In [10]: pyInMine['struArray']['name'].shape
Out[10]: (1, 3)
In [11]: pyInMine['struArray']['name'][0,1]
Out[11]: 'two'

The hdf5storage package is a little bit nicer and lets you either index the structure and then grab the field, or vice versa, because of how structured numpy object arrays work: hdf5storage package 稍微好一点,它允许您索引结构然后获取字段,反之亦然,因为结构numpy object arrays 的工作方式:

In [12]: pyInHdf5['struArray'].shape
Out[12]: (1, 3)
In [13]: pyInHdf5['struArray'][0,1]['name']
Out[13]: array([['two']], dtype='<U3')
In [14]: pyInHdf5['struArray']['name'].shape
Out[14]: (1, 3)
In [15]: pyInHdf5['struArray']['name'][0,1]
Out[15]: array([['two']], dtype='<U3')

Again, the two packages treat the final output a little differently, but in general are both quite good at reading in v7.3 matfiles.同样,这两个包对最终 output 的处理方式略有不同,但通常都非常擅长读取 v7.3 matfiles。 Final thought that in the case of ~500MB+ files, I've found that the hdf5storage package hangs while loading, while my package does not (though it still takes ~1.5 minutes to complete the load).最后想到,对于 ~500MB+ 文件,我发现hdf5storage package 在加载时挂起,而我的 package 则没有(尽管它仍然需要 ~1.5 分钟才能完成加载)。

It's really a problem with Matlab 7.3 and h5py. 这对Matlab 7.3和h5py来说确实是一个问题。 My trick is to convert the h5py._hl.dataset.Dataset type to numpy array. 我的诀窍是将h5py._hl.dataset.Dataset类型转换为numpy数组。 For example, 例如,

np.array(data['data'])

will solve your problem with the 'data' field. 将使用'data'字段解决您​​的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM