如何从 python 中的 HDF5 中提取数据？

Question

I have the following HDF5 file which I could extract a list ['model_cints'] inside data, however, I don't know of to show the data within the list data.我有以下 HDF5 文件，我可以在数据中提取列表 ['model_cints']，但是，我不知道要在列表数据中显示数据。

https://drive.google.com/drive/folders/1p0J7X4n7A39lHZpCAvv_cw3u-JUZ4WFU?usp=sharing https://drive.google.com/drive/folders/1p0J7X4n7A39lHZpCAvv_cw3u-JUZ4WFU?usp=sharing

I've tried using numpy.array using this code but I get these messages:我已经尝试使用 numpy.array 使用此代码，但我收到以下消息：

npa = np.asarray(data, dtype=np.float32)

 
ValueError: could not convert string to float: 'model_cints'


npa = np.asarray(data)

npa
Out[54]: array(['model_cints'], dtype='<U11')

This is the code:import h5py这是代码：import h5py

filename = "example.hdf5"

with h5py.File(filename, "r") as f:
    # List all groups
    print("Keys: %s" % f.keys())
    a_group_key = list(f.keys())[0]

    # Get the data
    data = list(f[a_group_key])

The data is inside ['model_cints']数据在 ['model_cints'] 内

Answer 1

If you are new to HDF5, I suggest a crawl, walk, run approach to understand the HDF5 data model, your specific data schema, and how to use the various APIs (including h5py and PyTables).如果您是 HDF5 新手，我建议您采用爬、走、跑的方法来了解 HDF5 数据 model、您的特定数据架构以及如何使用各种 API（包括 h5py 和 PyTables）。 HDF5 is designed to be self-describing. HDF5 被设计为自描述的。 In other words, you can figure out the schema by inspection.换句话说，您可以通过检查找出架构。 Understanding the schema is key to working with your data.了解架构是处理数据的关键。 Coding before you understand the schema is incredibly frustrating (been there, done that).在你理解模式之前编码是非常令人沮丧的（去过那里，做过）。

I suggest new users start with HDFView from The HDF Group .我建议新用户从HDF Group的HDFView开始。 This is a utility to view the data in a GUI without writing code.这是一个无需编写代码即可在 GUI 中查看数据的实用程序。 And, when you start writing code, it's also helpful to visually verify you read the data correctly.而且，当您开始编写代码时，直观地验证您是否正确读取数据也很有帮助。

Next, learn how to traverse the data structure.接下来，学习如何遍历数据结构。 In h5py, you can do this with the visititems() method.在 h5py 中，您可以使用visititems()方法来执行此操作。 I recently wrote a SO Answer with an example.我最近用一个例子写了一个 SO Answer。 See this answer: SO 65793692: visititems() method to recursively walk nodes请参阅此答案： SO 65793692: visititems() 方法递归遍历节点

In your case, it sounds like you only need to read the data in a dataset defined by this path: '[data/model_cints]' or '[data][model_cints]' .在您的情况下，听起来您只需要读取由此路径定义的数据集中的数据： '[data/model_cints]'或'[data][model_cints]' 。 Both are valid path definitions.两者都是有效的路径定义。 ( 'data' is a Group and 'model_cints' is a Dataset. Groups are similar to Folders/Directories and Datasets are like files.) （ 'data'是一个组， 'model_cints'是一个数据集。组类似于文件夹/目录，数据集类似于文件。）

Once you have a dataset path, you need to get the data type (like NumPy dtype).获得数据集路径后，您需要获取数据类型（如 NumPy dtype）。 You get this (and the shape attribute) with h5py the same way you do with NumPy.你用 h5py 得到这个（和 shape 属性）就像你用 NumPy 一样。 This is what I get for your dtype:这是我为您的 dtype 得到的：
[('fs_date', '<f8'), ('date', '<f8'), ('prob', 'i1'), ('ymin', '<f8'), ('ymax', '<f8'), ('type', 'O'), ('name', 'O')]

What you have is a array of mixed type: 4 floats, 1 int, and 2 strings.你所拥有的是一个混合类型的数组：4 个浮点数、1 个整数和 2 个字符串。 This is extracted as a NumPy record array.这被提取为 NumPy 记录数组。 This is different than a typical ndarray where all elements are the same type (all ints, or floats or strings).这与所有元素都是相同类型（所有整数、浮点数或字符串）的典型 ndarray 不同。 You access the with row indices (integers) and field names (although can also use column indices.您可以使用行索引（整数）和字段名称访问（尽管也可以使用列索引。

I pulled all of this together in the code below.我在下面的代码中将所有这些放在一起。 It shows different methods to access the data.它显示了访问数据的不同方法。 (Hopefully the multiple methods don't confuse this explanation.) Each are useful depending on how you want to read the data. （希望多种方法不会混淆这个解释。）每种方法都有用，具体取决于您要如何读取数据。

Note: This data looks like results from several tests combined into a single file.注意：此数据看起来像是将多个测试合并到一个文件中的结果。 If you may want to query to get particular test values, you should investigate PyTables.如果您可能想要查询以获取特定的测试值，您应该研究 PyTables。 It has some powerful search capabilities not available in h5py that simplify that task.它有一些在 h5py 中没有的强大搜索功能，可以简化该任务。 Good luck.祝你好运。

with h5py.File("example.hdf5", "r") as h5f:
    # Get a h5py dataset object
    data_ds = h5f['data']['model_cints']
    print ('data_ds dtype:', data_ds.dtype, '\nshape:', data_ds.shape)

    # get an array with all fs_date data only
    fs_date_arr = data_ds[:]['fs_date'] 
    print ('fs_date_arr dtype:', fs_date_arr .dtype, '\nshape:', fs_date_arr .shape)

    # Get the entire dataset as 1 numpy record array 
    data_arr_all = h5f['data']['model_cints'][:]
    # this also works:
    data_arr_all = data_ds[:]
    print ('data_arr_all dtype:', data_arr_all.dtype, '\nshape:', data_arr_all.shape)

    # Get the first 6 rows as 1 numpy record array 
    data_arr6 = h5f['data']['model_cints'][0:6][:]
    # this also works:
    data_arr6  = data_ds[0:6][:]
    print ('data_arr6 dtype:', data_arr6.dtype, '\nshape:', data_arr6.shape)

Answer 2

f['data'] is a Group object, which means it has keys. f['data']是一个Group object，这意味着它有键。 When you make an iterable out of it, eg, list(f['data']) , or you iterate it, for something in f['data']: , you're going to get its keys, of which it has one.当您从中创建一个可迭代对象时，例如list(f['data'])或迭代它， for something in f['data']:内容，您将获得它的键，其中它有一。 This explains这说明

>>> np.array(f['data'])
array(['model_cints'], dtype='<U11')

What you want instead is你想要的是

data = np.array(f['data']['model_cints'])

如何从 python 中的 HDF5 中提取数据？

问题描述

2 个解决方案

解决方案1
1 2021-01-24 19:32:10

解决方案2
0 2021-01-24 01:15:51

如何从 python 中的 HDF5 中提取数据？

问题描述

2 个解决方案

解决方案1 1 2021-01-24 19:32:10

解决方案2 0 2021-01-24 01:15:51

解决方案1
1 2021-01-24 19:32:10

解决方案2
0 2021-01-24 01:15:51