[英]How to deal with hdf5 files in R?
I have a file in hdf5
format.我有一个
hdf5
格式的文件。 I know that it is supposed to be a matrix, but I want to read that matrix in R
so that I can study it.我知道它应该是一个矩阵,但我想在
R
读取该矩阵,以便我可以研究它。 I see that there is a h5r
package that is supposed to help with this, but I do not see any simple to read/understand tutorial.我看到有一个
h5r
包应该可以帮助解决这个问题,但我没有看到任何简单易读/理解的教程。 Is such a tutorial available online.网上有这样的教程吗。 Specifically, How do you read a
hdf5
object with this package, and how to actually extract the matrix?具体来说,您如何使用此包读取
hdf5
对象,以及如何实际提取矩阵?
UPDATE更新
I found out a package rhdf5
which is not part of CRAN but is part of BioConductoR.我发现了一个包
rhdf5
,它不是 CRAN 的一部分,而是 BioConductoR 的一部分。 The interface is relatively easier to understand the the documentation and example code is quite clear.界面比较容易理解,文档和示例代码都比较清楚。 I could use it without problems.
我可以毫无问题地使用它。 My problem it seems was the input file.
我的问题似乎是输入文件。 The matrix that I wanted to read was actually stored in the
hdf5
file as a python pickle
.我想读取的矩阵实际上作为
python pickle
存储在hdf5
文件中。 So every time I tried to open it and access it through R
i got a segmentation fault
.因此,每次我尝试打开它并通过
R
访问它时,我都会遇到segmentation fault
。 I did figure out how to save the matrix from within python
as a tsv
file and now that problem is solved.我确实想出了如何从
python
中将矩阵保存为tsv
文件,现在这个问题解决了。
The rhdf5
package works really well, although it is not in CRAN. rhdf5
包工作得很好,虽然它不在 CRAN 中。 Install it from Bioconductor从Bioconductor安装
# as of 2020-09-08, these are the updated instructions per
# https://bioconductor.org/install/
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.11")
And to use it:并使用它:
library(rhdf5)
List the objects within the file to find the data group you want to read:列出文件中的对象以查找要读取的数据组:
h5ls("path/to/file.h5")
Read the HDF5 data:读取 HDF5 数据:
mydata <- h5read("path/to/file.h5", "/mygroup/mydata")
And inspect the structure :并检查结构:
str(mydata)
(Note that multidimensional arrays may appear transposed ). (请注意,多维数组可能会出现转置)。 Also you can read groups, which will be named lists in R.
您也可以读取组,这些组将在 R 中命名为列表。
You could also use h5 , a package which I recently published on CRAN.您也可以使用h5 ,这是我最近在 CRAN 上发布的一个包。 Compared to
rhdf5
it has the following features:与
rhdf5
相比,它具有以下特点:
readdata <- dataset[1:3, 1:3] dataset[1:3, 1:3] <- matrix(1:9, nrow = 3)
readdata <- dataset[1:3, 1:3] dataset[1:3, 1:3] <- matrix(1:9, nrow = 3)
等命令的数据集实现了类似 R 的子集运算符 To save a matrix you could use:要保存矩阵,您可以使用:
library(h5)
testmat <- matrix(rnorm(120), ncol = 3)
# Create HDF5 File
file <- h5file("test.h5")
# Save matrix to file in group 'testgroup' and datasetname 'testmat'
file["testgroup", "testmat"] <- testmat
# Close file
h5close(file)
... and read the entire matrix back into R: ...并将整个矩阵读回 R:
file <- h5file("test.h5")
testmat_in <- file["testgroup", "testmat"][]
h5close(file)
See also h5 on另见 h5
I used the rgdal
package to read HDF5 files.我使用
rgdal
包来读取 HDF5 文件。 You do need to take care that probably the binary version of rgdal
does not support hdf5
.您确实需要注意
rgdal
的二进制版本可能不支持hdf5
。 In that case, you need to build gdal
from source with HDF5 support before building rgdal
from source.在这种情况下,在从源代码构建
rgdal
之前,您需要从具有 HDF5 支持的源代码构建gdal
。
Alternatively, try and convert the files from hdf5
to netcdf
.或者,尝试将文件从
hdf5
转换为netcdf
。 Once they are in netcdf, you can use the excellent ncdf
package to access the data.一旦它们在 netcdf 中,您就可以使用优秀的
ncdf
包来访问数据。 The conversion I think could be done with the cdo
tool .我认为可以使用
cdo
工具完成转换。
The ncdf4
package, an interface to netCDF-4, can also be used to read hdf5 files (netCDF-4 is compatible with netCDF-3, but it uses hdf5 as the storage layer). ncdf4
包是 netCDF-4 的接口,也可用于读取 hdf5 文件(netCDF-4 与 netCDF-3 兼容,但它使用 hdf5 作为存储层)。
In the developer's words:用开发者的话来说:
NetCDF-4 combines the netCDF-3 and HDF5 data models, taking the desirable characteristics of each, while taking advantage of their separate strengths
NetCDF-4 结合了 netCDF-3 和 HDF5 数据模型,利用每个模型的理想特征,同时利用它们各自的优势
The netCDF-4 format implements and expands the netCDF-3 data model by using an enhanced version of HDF5 as the storage layer.
netCDF-4 格式通过使用 HDF5 的增强版本作为存储层来实现和扩展 netCDF-3 数据模型。
In practice, ncdf4
provides a simple interface, and migrating code from using older hdf5
and ncdf
packages to a single ncdf4
package has made our code less buggy and easier to write (some of my trials and workarounds are documented in my previous answer ).在实践中,
ncdf4
提供了一个简单的界面,将代码从使用较旧的hdf5
和ncdf
包迁移到单个ncdf4
包使我们的代码错误更少ncdf4
易于编写(我的一些试验和解决方法记录在我之前的答案中)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.