简体   繁体   中英

Xarray to merge two hdf5 file with different dimension length

I have some instrumental data which saved in hdf-5 format as multiple 2-d array along with the measuring time. As attached figures below, d1 and d2 are two independent file in which the instrument recorded in different time. They have the same data variables, and the only difference is the length of phony_dim_0 , which represet the total data points varying with measurement time.

在此处输入图片说明

在此处输入图片说明

These files need to be loaded to a specific software provided by the instrument company for obtaining meaningful results. I want to merge multiple files with Python xarray while keeping in their original format, and then loaed one merged file into the software.

Here is my attempt:

files = os.listdir("DATA_PATH")
d1 = xarray.open_dataset(files[0])
d2 = xarray.open_dataset(files[1])

## copy a new one to save the merged data array.
d0 = d1

vars_ = [c for c in d1]
for var in vars_:
    d0[var].values = np.vstack([d1[var],d2[var]])

The error shows like this: replacement data must match the Variable's shape. replacement data has shape (761, 200); Variable has shape (441, 200) replacement data must match the Variable's shape. replacement data has shape (761, 200); Variable has shape (441, 200)

I thought about two solution for this problem:

  1. expanding the dimension length to the total length of all merged files.
  2. creating a new empty dataframe in the same format of d1 and d2.

However, I still could not figure out the function to achieve that. Any comments or suggestions would be appreciated.

Supplemental information

dataset example [d1] , [d2]

I'm not familiar with xarray, so can't help with your code. However, you don't need xarray to copy HDF5 data; h5py is designed to work nicely with HDF5 data as NumPy arrays, and is all you need to get merge the data.

A note about Xarray. It uses different nomenclature than HDF5 and h5py. Xarray refers to the files as 'datasets', and calls the HDF5 datasets 'data variables'. HDF5/h5py nomenclature is more frequently used, so I am going to use it for the rest of my post.

There are some things to consider when merging datasets across 2 or more HDF5 files. They are:

  1. Consistency of the data schema (which you have checked).
  2. Consistency of attributes. If datasets have different attribute names or values, the merge process gets a lot more complicated! (Yours appear to be consistent.)
  3. It's preferable to create resizabe datasets in the merged file. This simplifies the process, as you don't need to know the total size when you initially create the dataset. Better yet, you can add more data later (if/when you have more files).

I looked at your files. You have 8 HDF5 datasets in each file. One nice thing: the datasets are resizble. That simplifies the merge process. Also, although your datasets have a lot of attributes, they appear to be common in both files. That also simplifies the process.

The code below goes through the following steps to merge the data.

  1. Open the new merge file for writing
  2. Open the first data file (read-only)
  3. Loop thru all data sets
    a. use the group copy function to copy the dataset (data plus maxshape parameters, and attribute names and values).
  4. Open the second data file (read-only)
  5. Loop thru all data sets and do the following:
    a. get the size of the 2 datasets (existing and to be added)
    b.increase the size of HDF5 dataset with .resize() method
    c. write values from dataset to end of existing dataset
  6. At the end it loops thru all 3 files and prints shape and maxshape for all datasets (for visual comparison).

Code below:

import h5py

files = [ '211008_778183_m.h5', '211008_778624_m.h5', 'merged_.h5' ]

# Create the merge file:
    with h5py.File('merged_.h5','w') as h5fw:
    
    # Open first HDF5 file and copy each dataset.
    # Will use maxhape and attributes from existing dataset.
    with h5py.File(files[0],'r') as h5fr:            
        for ds in h5fr.keys():
            h5fw.copy(h5fr[ds], h5fw, name=ds)
                 
    # Open second HDF5 file and copy data from each dataset.
    # Resizes existing dataset as needed to hold new data.
    with h5py.File(files[1],'r') as h5fr:            
        for ds in h5fr.keys():
            ds_a0 = h5fw[ds].shape[0]
            add_a0 = h5fr[ds].shape[0]
            h5fw[ds].resize(ds_a0+add_a0,axis=0)
            h5fw[ds][ds_a0:] = h5fr[ds][:]
    
for fname in files:
    print(f'Working on file:{fname}')
    with h5py.File(fname,'r') as h5f:
        for ds, h5obj in h5f.items():
            print (f'for: {ds}; axshape={h5obj.shape}, maxshape={h5obj.maxshape}')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM