简体   繁体   中英

h5py, enums, and VirtualLayout

I am attempting to concatenate a bunch of h5 datafiles into a single larger file to feed into my program. I am doing this by using h5py.VirtualLayout to However, I am struggling to maintain the format in the merged file. This is because I have some datasets in the file which are bools. Using h5dump, I can see that the format is:

 DATASET "mask" { DATATYPE H5T_ENUM { H5T_STD_I8LE; "FALSE" 0; "TRUE" 1; }

In the original data generation, they are defined as

mask_set = source_group.create_dataset("mask", data=np.array(mask_list, dtype='bool'))

But after concatenation, they end up as:

 DATASET "mask" { DATATYPE H5T_IEEE_F64LE

or similar, depending on what I pass as dtype to h5py.VirtualLayout. The above is using dtype=None, but I continue to have problems specifying int, np.int, etc.

If I try to pass dtype defined with a line like:

dtype = 'bool'

or

dtype = h5py.enum_dtype({"FALSE":0, "TRUE":1}, basetype=np.int8)

I get an error:

ValueError: Unable to create dataset (no appropriate function for conversion path)

So my question is; how do preserve the enum type when using VirtualLayout?

You didn't post the code that creates the Virtual Layout, so it's hard to diagnose what you are doing wrong. I suspect it's something in that code segment. You should be able to do this. To demonstrate, I adapted the h5py example vds_simple.py to create a Virtual Layout of boolean values from 4 HDF5 files/datasets of booleans.

The example is self-contained. It creates 4 'source' HDF5 files ( #_bool.h5 ), each with a 1D dataset containing a (1,10) slice from a (4,10) array of booleans. It then creates a separate file ( VDS_bool.h5 ), with a single 4x10 virtual dataset that exposes the 4 sources as 1 dataset. In addition, the original random boolean array is added for comparison. Output shows result of testing the 2 datasets plus the dataset contents.

Code below:

# Array for random sampling
sample_arr = [True, False]
bool_arr = np.random.choice(sample_arr, size=40).reshape(4,10)

# Create source files (0.h5 to 3.h5)
for n in range(4):
    with h5py.File(f"{n}_bool.h5", "w") as f:
        d = f.create_dataset("bdata", data=bool_arr[n])

# Assemble virtual dataset
layout = h5py.VirtualLayout(shape=(4, 10), dtype='bool')
for n in range(4):
    filename = "{}_bool.h5".format(n)
    vsource = h5py.VirtualSource(filename, "bdata", shape=(10,))
    layout[n] = vsource

# Add virtual dataset to output file
with h5py.File("VDS_bool.h5", "w") as f:
    f.create_virtual_dataset("vdata", layout)
    f.create_dataset("bdata", data=bool_arr)

# read data back
# virtual dataset is transparent for reader!
with h5py.File("VDS_bool.h5", "r") as f:
    print("Virtual vs Normal dataset Equality test:")
    print(np.array_equal(f["vdata"][:],f["bdata"][:]),"\n")
    print("Virtual boolean dataset:")
    print(f["vdata"][:])
    print("Normal boolean dataset:")
    print(f["bdata"][:])

For reference, here are results from h5dump for the datasets. Both virtual and normal dataset datatypes are the same, but slightly different from yours:

DATATYPE  H5T_ENUM {
   H5T_STD_I8LE; 

h5dump -H 0_bool.h5

HDF5 "0_bool.h5" {
GROUP "/" {
   DATASET "bdata" {
      DATATYPE  H5T_ENUM {
         H5T_STD_I8LE;
         "FALSE"            0;
         "TRUE"             1;
      }
      DATASPACE  SIMPLE { ( 10 ) / ( 10 ) }
   }
}
}

h5dump -H VDS_bool.h5

HDF5 "VDS_bool.h5" {
GROUP "/" {
   DATASET "bdata" {
      DATATYPE  H5T_ENUM {
         H5T_STD_I8LE;
         "FALSE"            0;
         "TRUE"             1;
      }
      DATASPACE  SIMPLE { ( 4, 10 ) / ( 4, 10 ) }
   }
   DATASET "vdata" {
      DATATYPE  H5T_ENUM {
         H5T_STD_I8LE;
         "FALSE"            0;
         "TRUE"             1;
      }
      DATASPACE  SIMPLE { ( 4, 10 ) / ( 4, 10 ) }
   }
}
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM