MATLAB python 中的 7.3 文件与 hdf5storage 膨胀文件并且创建文件很慢

Question

我正在尝试使用 hdf5storage 将 numpy 数据写入 a.mat 文件。

import hdf5storage

# For example
numpy_array = [array([(b'<detect>', 192, 1)], dtype=[('packet_sync', 'S8'), ('n_bytes', '<u4'), ('n_detect', '<u4')]), array([(b'<detect>', 192, 2)], dtype=[('packet_sync', 'S8'), ('n_bytes', '<u4'), ('n_detect', '<u4')])]

# The actual array is 192 bytes. and a binary file I am attempting to create a .mat file for contains thousands of these packets.

data = {"data": numpy_array}

hdf5storage.savemat(file_name="data.mat", mdict=data, format="7.3")

使用此便利 function，或等效地

hdf5storage.write(data, '.', 'data.mat', matlab_compatible=True)

该文件扩展到二进制文件大小的 10 倍以上，这是一个 python 列表，其中包含 numpy 种数据类型，由 c 种基本类型（<u4、<f4、<S8...）组成。 处理一个 70MB 的文件也需要 >1 小时，这似乎有些不对劲，但我对 HDF5 格式没有太多经验，所以这可能是预料之中的。

当测试从 MATLAB 保存一个类似的变量时

save("test.mat", 'variable', '-v7.3')

文件大小仍然比二进制大小大得多。 所以正如@hpaulj 指出的那样，HDF5 不是一种紧凑的格式。 但是python中保存的时间也是不行的。 在 MATLAB 中，文件保存在几秒钟内，使用 hdf5storage 库保存相同的文件，大约需要一个小时。 也许这个库只是性能不佳？

虽然在运行时查看磁盘写入速度，但我通过 iotop 看到 2-3 M/s 的统计数据，而文件仅增长 ~0.5MB/s。

我想避免写入 separate.mat 文件。

当使用 scipy 的 savemat 时，我能够保存文件高达 matlab v5 限制 2GB，但我们生成的数据比这更多，并且希望能够使用 v7.3 matlab 格式。 所以问题出在 hdf5storage 库上，因为 scipy 仍然有效。

matlab v7.3 格式是否有一些 numpy dtype 限制？

为什么这些文件会膨胀？ 我缺少 hdf5storage 中的选项吗？ 我查看了文档，部分查看了代码，但无济于事。

或者，我可以尝试将 hdf5 文件加载到 MATLAB

import h5py
hf = h5py.File("test.h5", "w")
hf.create_dataset("data", data=data)
hf.close()

编辑：我发现我的麻烦可能是由于非同质数据形状造成的。 我可以有可变大小的数据包。 HDF5 显然不能很好地处理这个问题，因此为同质性构建数据很重要。

Answer 1

我没有hdf5storage 。

In [21]: numpy_array = np.array(
    ...:     [(b"<detect>", 192, 1)],
    ...:     dtype=[("packet_sync", "S8"), ("n_bytes", "<u4"), ("n_detect", "<u4")],
    ...: ) 
In [22]: numpy_array
Out[22]: 
array([(b'<detect>', 192, 1)],
      dtype=[('packet_sync', 'S8'), ('n_bytes', '<u4'), ('n_detect', '<u4')])
In [23]: numpy_array.nbytes
Out[23]: 16
In [24]: data = {"data": numpy_array}

但是使用 7.3 之前的格式：

In [25]: from scipy import io
In [26]: io.savemat("test712.mat", data)
In [27]: io.loadmat("test712.mat")
Out[27]: 
{'__header__': b'MATLAB 5.0 MAT-file Platform: posix, Created on: Tue Feb 22 18:02:29 2022',
 '__version__': '1.0',
 '__globals__': [],
 'data': array([[(array(['<detect>'], dtype='<U8'), array([[192]], dtype=uint32), array([[1]], dtype=uint32))]],
       dtype=[('packet_sync', 'O'), ('n_bytes', 'O'), ('n_detect', 'O')])}
In [28]: ll test712.mat
...
-rw-rw-r-- 1 paul 408 Feb 22 18:02 test712.mat

你确实给出了关于“通货膨胀”的细节，但是 16 字节的数组被保存到一个 408 字节的文件中。

用原生numpy保存，文件小了一点。 其中大部分是指定形状和数据类型的 header 块：

In [29]: np.save("test712.npy", numpy_array)
In [30]: ll test712.npy
-rw-rw-r-- 1 paul 208 Feb 22 18:05 test712.npy

并使用更基本的h5py保存：

In [32]: f = h5py.File("test712.h5", "w")
In [33]: f.create_dataset("array", data=numpy_array)
Out[33]: <HDF5 dataset "array": shape (1,), type "|V16">
In [34]: f.close()
In [35]: %ll test712.h5
-rw-rw-r-- 1 paul 2064 Feb 22 18:08 test712.h5

In [37]: f = h5py.File("test712.h5", "r")
In [40]: f["array"][:]
Out[40]: 
array([(b'<detect>', 192, 1)],
      dtype=[('packet_sync', 'S8'), ('n_bytes', '<u4'), ('n_detect', '<u4')])

HDF5 文件格式不是紧凑的，所以我认为谈论膨胀是没有意义的。 在这种情况下，我怀疑大部分大小是由于布局和标题造成的，而不是数据本身造成的。 保存一个 4 元素数组（64 字节而不是 16 字节）可能不会改变文件大小——对于任何格式。

MATLAB python 中的 7.3 文件与 hdf5storage 膨胀文件并且创建文件很慢

问题描述

1 个解决方案

解决方案1
1 2022-02-23 02:10:37

MATLAB python 中的 7.3 文件与 hdf5storage 膨胀文件并且创建文件很慢

问题描述

1 个解决方案

解决方案1 1 2022-02-23 02:10:37

解决方案1
1 2022-02-23 02:10:37