简体   繁体   English

附加到 h5 文件

[英]Appending to h5 files

I have a h5 file which contains a dataset like this:我有一个 h5 文件,其中包含这样的数据集:

col1.      col2.      col3
 1           3          5
 5           4          9
 6           8          0
 7           2          5
 2           1          2

I have another h5 file with the same columns:我有另一个具有相同列的 h5 文件:

col1.      col2.      col3
 6           1          9
 8           2          7

and I would like to concatenate these two to have the following h5 file:我想将这两个连接起来以获得以下 h5 文件:

col1.      col2.      col3
 1           3          5
 5           4          9
 6           8          0
 7           2          5
 2           1          2
 6           1          9
 8           2          7

What is the most efficient way to do this if files are huge or we have many of these merges?如果文件很大或者我们有很多这样的合并,那么最有效的方法是什么?

I'm not familiar with pandas, so can't help there.我不熟悉 pandas,所以无能为力。 This can be done with h5py or pytables.这可以通过 h5py 或 pytables 来完成。 As @hpaulj mentioned, the process reads the dataset into a numpy array then writes to a HDF5 dataset with h5py.正如@hpaulj 提到的,该过程将数据集读入 numpy 数组,然后使用 h5py 写入 HDF5 数据集。 The exact process depends on the maxshape attribute (it controls if the dataset can be resized or not).确切的过程取决于 maxshape 属性(它控制是否可以调整数据集的大小)。

I created examples to show both methods (fixed size or resizeable dataset).我创建了示例来展示这两种方法(固定大小或可调整大小的数据集)。 The first method creates a new file3 that combines the values from file1 and file2.第一种方法创建一个新的 file3,它结合了 file1 和 file2 的值。 The second method adds the values from file2 to file1e (that is resizable).第二种方法将 file2 中的值添加到 file1e(可调整大小)。 Note: code to create the files used in the examples is at the end.注意:创建示例中使用的文件的代码在最后。

I have a longer answer on SO that shows all the ways to copy data.我对 SO 有一个更长的答案,它显示了复制数据的所有方法。
See this Answer: How can I combine multiple.h5 file?看到这个答案:如何合并多个.h5 文件?

Method 1: Combine datasets into a new file方法一:将数据集合并到一个新文件中
Required when the datasets were not created with maxshape= parameter未使用maxshape=参数创建数据集时需要

with h5py.File('file1.h5','r') as h5f1,  \
     h5py.File('file2.h5','r') as h5f2,  \
     h5py.File('file3.h5','w') as h5f3 :
         
    print (h5f1['ds_1'].shape, h5f1['ds_1'].maxshape)
    print (h5f2['ds_2'].shape, h5f2['ds_2'].maxshape)    

    arr1_a0 = h5f1['ds_1'].shape[0]            
    arr2_a0 = h5f2['ds_2'].shape[0]            
    arr3_a0 = arr1_a0 + arr2_a0          
    h5f3.create_dataset('ds_3', dtype=h5f1['ds_1'].dtype,
                        shape=(arr3_a0,3), maxshape=(None,3))

    xfer_arr1 = h5f1['ds_1']               
    h5f3['ds_3'][0:arr1_a0, :] = xfer_arr1
 
    xfer_arr2 = h5f2['ds_2']   
    h5f3['ds_3'][arr1_a0:arr3_a0, :] = xfer_arr2

    print (h5f3['ds_3'].shape, h5f3['ds_3'].maxshape)

Method 2: Appended file2 dataset to file1 dataset方法 2:将 file2 数据集附加到 file1 数据集
The datasets in file1e must be created with maxshape= parameter file1e 中的数据集必须使用maxshape=参数创建

with h5py.File('file1e.h5','r+') as h5f1, \
     h5py.File('file2.h5','r') as h5f2 :

    print (h5f1['ds_1e'].shape, h5f1['ds_1e'].maxshape)
    print (h5f2['ds_2'].shape, h5f2['ds_2'].maxshape)    
    
    arr1_a0 = h5f1['ds_1e'].shape[0]            
    arr2_a0 = h5f2['ds_2'].shape[0] 
    arr3_a0 = arr1_a0 + arr2_a0          

    h5f1['ds_1e'].resize(arr3_a0,axis=0)
    
    xfer_arr2 = h5f2['ds_2']   
    h5f1['ds_1e'][arr1_a0:arr3_a0, :] = xfer_arr2

    print (h5f1['ds_1e'].shape, h5f1['ds_1e'].maxshape)

Code to create the example files used above:创建上面使用的示例文件的代码:

import h5py
import numpy as np

arr1 = np.array([[ 1, 3, 5 ],
                 [ 5, 4, 9 ],
                 [ 6, 8, 0 ],
                 [ 7, 2, 5 ],
                 [ 2, 1, 2 ]] )

with h5py.File('file1.h5','w') as h5f:
    h5f.create_dataset('ds_1',data=arr1)
    print (h5f['ds_1'].maxshape)   
    
with h5py.File('file1e.h5','w') as h5f:
    h5f.create_dataset('ds_1e',data=arr1, shape=(5,3), maxshape=(None,3))
    print (h5f['ds_1e'].maxshape)             
                 
arr2 = np.array([[ 6, 1, 9 ],
                 [ 8, 2, 7 ]] )
                 
with h5py.File('file2.h5','w') as h5f:
    h5f.create_dataset('ds_2',data=arr2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM