简体   繁体   English

如何从 python 中的 HDF5 中提取数据?

[英]How extract data from HDF5 in python?

I have the following HDF5 file which I could extract a list ['model_cints'] inside data, however, I don't know of to show the data within the list data.我有以下 HDF5 文件,我可以在数据中提取列表 ['model_cints'],但是,我不知道要在列表数据中显示数据。

https://drive.google.com/drive/folders/1p0J7X4n7A39lHZpCAvv_cw3u-JUZ4WFU?usp=sharing https://drive.google.com/drive/folders/1p0J7X4n7A39lHZpCAvv_cw3u-JUZ4WFU?usp=sharing

I've tried using numpy.array using this code but I get these messages:我已经尝试使用 numpy.array 使用此代码,但我收到以下消息:

npa = np.asarray(data, dtype=np.float32)

 
ValueError: could not convert string to float: 'model_cints'


npa = np.asarray(data)

npa
Out[54]: array(['model_cints'], dtype='<U11')

This is the code:import h5py这是代码:import h5py

filename = "example.hdf5"

with h5py.File(filename, "r") as f:
    # List all groups
    print("Keys: %s" % f.keys())
    a_group_key = list(f.keys())[0]

    # Get the data
    data = list(f[a_group_key])

The data is inside ['model_cints']数据在 ['model_cints'] 内

If you are new to HDF5, I suggest a crawl, walk, run approach to understand the HDF5 data model, your specific data schema, and how to use the various APIs (including h5py and PyTables).如果您是 HDF5 新手,我建议您采用爬、走、跑的方法来了解 HDF5 数据 model、您的特定数据架构以及如何使用各种 API(包括 h5py 和 PyTables)。 HDF5 is designed to be self-describing. HDF5 被设计为自描述的。 In other words, you can figure out the schema by inspection.换句话说,您可以通过检查找出架构。 Understanding the schema is key to working with your data.了解架构是处理数据的关键。 Coding before you understand the schema is incredibly frustrating (been there, done that).在你理解模式之前编码是非常令人沮丧的(去过那里,做过)。

I suggest new users start with HDFView from The HDF Group .我建议新用户从HDF GroupHDFView开始。 This is a utility to view the data in a GUI without writing code.这是一个无需编写代码即可在 GUI 中查看数据的实用程序。 And, when you start writing code, it's also helpful to visually verify you read the data correctly.而且,当您开始编写代码时,直观地验证您是否正确读取数据也很有帮助。

Next, learn how to traverse the data structure.接下来,学习如何遍历数据结构。 In h5py, you can do this with the visititems() method.在 h5py 中,您可以使用visititems()方法来执行此操作。 I recently wrote a SO Answer with an example.我最近用一个例子写了一个 SO Answer。 See this answer: SO 65793692: visititems() method to recursively walk nodes请参阅此答案: SO 65793692: visititems() 方法递归遍历节点

In your case, it sounds like you only need to read the data in a dataset defined by this path: '[data/model_cints]' or '[data][model_cints]' .在您的情况下,听起来您只需要读取由此路径定义的数据集中的数据: '[data/model_cints]''[data][model_cints]' Both are valid path definitions.两者都是有效的路径定义。 ( 'data' is a Group and 'model_cints' is a Dataset. Groups are similar to Folders/Directories and Datasets are like files.) 'data'是一个组, 'model_cints'是一个数据集。组类似于文件夹/目录,数据集类似于文件。)

Once you have a dataset path, you need to get the data type (like NumPy dtype).获得数据集路径后,您需要获取数据类型(如 NumPy dtype)。 You get this (and the shape attribute) with h5py the same way you do with NumPy.你用 h5py 得到这个(和 shape 属性)就像你用 NumPy 一样。 This is what I get for your dtype:这是我为您的 dtype 得到的:
[('fs_date', '<f8'), ('date', '<f8'), ('prob', 'i1'), ('ymin', '<f8'), ('ymax', '<f8'), ('type', 'O'), ('name', 'O')]

What you have is a array of mixed type: 4 floats, 1 int, and 2 strings.你所拥有的是一个混合类型的数组:4 个浮点数、1 个整数和 2 个字符串。 This is extracted as a NumPy record array.这被提取为 NumPy 记录数组。 This is different than a typical ndarray where all elements are the same type (all ints, or floats or strings).这与所有元素都是相同类型(所有整数、浮点数或字符串)的典型 ndarray 不同。 You access the with row indices (integers) and field names (although can also use column indices.您可以使用行索引(整数)和字段名称访问 (尽管也可以使用列索引。

I pulled all of this together in the code below.我在下面的代码中将所有这些放在一起。 It shows different methods to access the data.它显示了访问数据的不同方法。 (Hopefully the multiple methods don't confuse this explanation.) Each are useful depending on how you want to read the data. (希望多种方法不会混淆这个解释。)每种方法都有用,具体取决于您要如何读取数据。

Note: This data looks like results from several tests combined into a single file.注意:此数据看起来像是将多个测试合并到一个文件中的结果。 If you may want to query to get particular test values, you should investigate PyTables.如果您可能想要查询以获取特定的测试值,您应该研究 PyTables。 It has some powerful search capabilities not available in h5py that simplify that task.它有一些在 h5py 中没有的强大搜索功能,可以简化该任务。 Good luck.祝你好运。

with h5py.File("example.hdf5", "r") as h5f:
    # Get a h5py dataset object
    data_ds = h5f['data']['model_cints']
    print ('data_ds dtype:', data_ds.dtype, '\nshape:', data_ds.shape)

    # get an array with all fs_date data only
    fs_date_arr = data_ds[:]['fs_date'] 
    print ('fs_date_arr dtype:', fs_date_arr .dtype, '\nshape:', fs_date_arr .shape)

    # Get the entire dataset as 1 numpy record array 
    data_arr_all = h5f['data']['model_cints'][:]
    # this also works:
    data_arr_all = data_ds[:]
    print ('data_arr_all dtype:', data_arr_all.dtype, '\nshape:', data_arr_all.shape)

    # Get the first 6 rows as 1 numpy record array 
    data_arr6 = h5f['data']['model_cints'][0:6][:]
    # this also works:
    data_arr6  = data_ds[0:6][:]
    print ('data_arr6 dtype:', data_arr6.dtype, '\nshape:', data_arr6.shape)

f['data'] is a Group object, which means it has keys. f['data']是一个Group object,这意味着它有键。 When you make an iterable out of it, eg, list(f['data']) , or you iterate it, for something in f['data']: , you're going to get its keys, of which it has one.当您从中创建一个可迭代对象时,例如list(f['data'])或迭代它, for something in f['data']:内容,您将获得它的键,其中它有一。 This explains这说明

>>> np.array(f['data'])
array(['model_cints'], dtype='<U11')

What you want instead is你想要的是

data = np.array(f['data']['model_cints'])

如果 dtype 为“如何在 python 3.6 中从 hdf5 文件中获取数据数组” <u4”?< div><div id="text_translate"><p> 我想从 hdf5 文件中获取格式为 {N, 16, 512, 128} 的数据集作为 4D numpy 数组。 N 是 3D arrays 的数字,格式为 {16, 512, 128}。 我尝试这样做:</p><pre> import os import sys import h5py as h5 import numpy as np import subprocess import re file_name = sys.argv[1] path = sys.argv[2] f = h5.File(file_name, 'r') data = f[path] print(data.shape) #{27270, 16, 512, 128} print(data.dtype) #"&lt;u4" data = np.array(data, dtype=np.uint32) print(data.shape)</pre><p> 不幸的是,在data = np.array(data, dtype=np.uint32)命令之后,代码似乎崩溃了,因为之后什么也没发生。</p><p> 我需要将此数据集检索为 numpy 数组,或者可能类似的东西以进行进一步计算。 如果您有任何建议,请告诉我。 </p></div></u4”?<> - How in python 3.6 to get data array from hdf5 file if dtype is “<u4”?

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从 HDF5 文件中提取数据以填充 PyTables 表? - How to extract data from HDF5 file to fill PyTables table? 如何在Python中提取和读取bzip2ed hdf5文件? - How to extract and read a bzip2ed hdf5 file in Python? 如何从 hdf5 保存/提取数据集并转换为 TiFF? - How to save/extract dataset from hdf5 and convert into TiFF? 使用 Python 将数据从 CSV 和 PDF 复制到 HDF5 - Copy data from CSV and PDF into HDF5 using Python 如果 dtype 为“如何在 python 3.6 中从 hdf5 文件中获取数据数组” <u4”?< div><div id="text_translate"><p> 我想从 hdf5 文件中获取格式为 {N, 16, 512, 128} 的数据集作为 4D numpy 数组。 N 是 3D arrays 的数字,格式为 {16, 512, 128}。 我尝试这样做:</p><pre> import os import sys import h5py as h5 import numpy as np import subprocess import re file_name = sys.argv[1] path = sys.argv[2] f = h5.File(file_name, 'r') data = f[path] print(data.shape) #{27270, 16, 512, 128} print(data.dtype) #"&lt;u4" data = np.array(data, dtype=np.uint32) print(data.shape)</pre><p> 不幸的是,在data = np.array(data, dtype=np.uint32)命令之后,代码似乎崩溃了,因为之后什么也没发生。</p><p> 我需要将此数据集检索为 numpy 数组,或者可能类似的东西以进行进一步计算。 如果您有任何建议,请告诉我。 </p></div></u4”?<> - How in python 3.6 to get data array from hdf5 file if dtype is “<u4”? 如何将HDF5数据映射到多个Python进程? - How can I mmap HDF5 data into multiple Python processes? 如何从磁盘加载,处理,然后将数据与python,pyqt,h5py同时存储在公共hdf5中? - How to load from disk, process, then store data in a common hdf5 concurrently with python, pyqt, h5py? 如何在python中安全地将单个hdf5文件中的数据并行写入多个文件? - how do I safely write data from a single hdf5 file to multiple files in parallel in python? 如何在caffe中读取带有python层的hdf5并进行数据扩充? - How to read hdf5 with python layer in caffe and do data augmentation? 使用Pandas,Python将数据附加到HDF5文件 - Append data to HDF5 file with Pandas, Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM