简体   繁体   English

如何处理R中的hdf5文件?

[英]How to deal with hdf5 files in R?

I have a file in hdf5 format.我有一个hdf5格式的文件。 I know that it is supposed to be a matrix, but I want to read that matrix in R so that I can study it.我知道它应该是一个矩阵,但我想在R读取该矩阵,以便我可以研究它。 I see that there is a h5r package that is supposed to help with this, but I do not see any simple to read/understand tutorial.我看到有一个h5r包应该可以帮助解决这个问题,但我没有看到任何简单易读/理解的教程。 Is such a tutorial available online.网上有这样的教程吗。 Specifically, How do you read a hdf5 object with this package, and how to actually extract the matrix?具体来说,您如何使用此包读取hdf5对象,以及如何实际提取矩阵?

UPDATE更新

I found out a package rhdf5 which is not part of CRAN but is part of BioConductoR.我发现了一个包rhdf5 ,它不是 CRAN 的一部分,而是 BioConductoR 的一部分。 The interface is relatively easier to understand the the documentation and example code is quite clear.界面比较容易理解,文档和示例代码都比较清楚。 I could use it without problems.我可以毫无问题地使用它。 My problem it seems was the input file.我的问题似乎是输入文件。 The matrix that I wanted to read was actually stored in the hdf5 file as a python pickle .我想读取的矩阵实际上作为python pickle存储在hdf5文件中。 So every time I tried to open it and access it through R i got a segmentation fault .因此,每次我尝试打开它并通过R访问它时,我都会遇到segmentation fault I did figure out how to save the matrix from within python as a tsv file and now that problem is solved.我确实想出了如何从python中将矩阵保存为tsv文件,现在这个问题解决了。

The rhdf5 package works really well, although it is not in CRAN. rhdf5包工作得很好,虽然它不在 CRAN 中。 Install it from BioconductorBioconductor安装

# as of 2020-09-08, these are the updated instructions per
# https://bioconductor.org/install/

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install(version = "3.11")

And to use it:并使用它:

library(rhdf5)

List the objects within the file to find the data group you want to read:列出文件中的对象以查找要读取的数据组:

h5ls("path/to/file.h5")

Read the HDF5 data:读取 HDF5 数据:

mydata <- h5read("path/to/file.h5", "/mygroup/mydata")

And inspect the structure :检查结构

str(mydata)

(Note that multidimensional arrays may appear transposed ). (请注意,多维数组可能会出现转置)。 Also you can read groups, which will be named lists in R.您也可以读取组,这些组将在 R 中命名为列表。

You could also use h5 , a package which I recently published on CRAN.您也可以使用h5 ,这是我最近在 CRAN 上发布的一个包。 Compared to rhdf5 it has the following features:rhdf5相比,它具有以下特点:

  1. S4 object model to directly interact with HDF5 objects like files, groups, datasets and attributes. S4 对象模型直接与 HDF5 对象(如文件、组、数据集和属性)交互。
  2. Simpler syntax, implemented R-like subsetting operators for datasets supporting commands like readdata <- dataset[1:3, 1:3] dataset[1:3, 1:3] <- matrix(1:9, nrow = 3)语法更简单,为支持readdata <- dataset[1:3, 1:3] dataset[1:3, 1:3] <- matrix(1:9, nrow = 3)等命令的数据集实现了类似 R 的子集运算符
  3. Supported NA values for all data types支持所有数据类型的 NA 值
  4. 200+ Test cases with a code coverage of 80%+. 200 多个测试用例,代码覆盖率超过 80%。

To save a matrix you could use:要保存矩阵,您可以使用:

library(h5)
testmat <- matrix(rnorm(120), ncol = 3)
# Create HDF5 File
file <- h5file("test.h5")
# Save matrix to file in group 'testgroup' and datasetname 'testmat'
file["testgroup", "testmat"] <- testmat
# Close file
h5close(file)

... and read the entire matrix back into R: ...并将整个矩阵读回 R:

file <- h5file("test.h5")
testmat_in <- file["testgroup", "testmat"][]
h5close(file)

See also h5 on另见 h5

I used the rgdal package to read HDF5 files.我使用rgdal包来读取 HDF5 文件。 You do need to take care that probably the binary version of rgdal does not support hdf5 .您确实需要注意rgdal的二进制版本可能不支持hdf5 In that case, you need to build gdal from source with HDF5 support before building rgdal from source.在这种情况下,在从源代码构建rgdal之前,您需要从具有 HDF5 支持的源代码构建gdal

Alternatively, try and convert the files from hdf5 to netcdf .或者,尝试将文件从hdf5转换为netcdf Once they are in netcdf, you can use the excellent ncdf package to access the data.一旦它们在 netcdf 中,您就可以使用优秀的ncdf包来访问数据。 The conversion I think could be done with the cdo tool .我认为可以使用cdo工具完成转换。

The ncdf4 package, an interface to netCDF-4, can also be used to read hdf5 files (netCDF-4 is compatible with netCDF-3, but it uses hdf5 as the storage layer). ncdf4包是 netCDF-4 的接口,也可用于读取 hdf5 文件(netCDF-4 与 netCDF-3 兼容,但它使用 hdf5 作为存储层)。

In the developer's words:用开发者的话来说:

NetCDF-4 combines the netCDF-3 and HDF5 data models, taking the desirable characteristics of each, while taking advantage of their separate strengths NetCDF-4 结合了 netCDF-3 和 HDF5 数据模型,利用每个模型的理想特征,同时利用它们各自的优势

The netCDF-4 format implements and expands the netCDF-3 data model by using an enhanced version of HDF5 as the storage layer. netCDF-4 格式通过使用 HDF5 的增强版本作为存储层来实现和扩展 netCDF-3 数据模型。

In practice, ncdf4 provides a simple interface, and migrating code from using older hdf5 and ncdf packages to a single ncdf4 package has made our code less buggy and easier to write (some of my trials and workarounds are documented in my previous answer ).在实践中, ncdf4提供了一个简单的界面,将代码从使用较旧的hdf5ncdf包迁移到单个ncdf4包使我们的代码错误更少ncdf4易于编写(我的一些试验和解决方法记录在我之前的答案中)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM