I am attempting to concatenate a bunch of h5 datafiles into a single larger file to feed into my program. I am doing this by using h5py.VirtualLayout to However, I am struggling to maintain the format in the merged file. This is because I have some datasets in the file which are bools. Using h5dump, I can see that the format is:
DATASET "mask" { DATATYPE H5T_ENUM { H5T_STD_I8LE; "FALSE" 0; "TRUE" 1; }
In the original data generation, they are defined as
mask_set = source_group.create_dataset("mask", data=np.array(mask_list, dtype='bool'))
But after concatenation, they end up as:
DATASET "mask" { DATATYPE H5T_IEEE_F64LE
or similar, depending on what I pass as dtype to h5py.VirtualLayout. The above is using dtype=None, but I continue to have problems specifying int, np.int, etc.
If I try to pass dtype defined with a line like:
dtype = 'bool'
or
dtype = h5py.enum_dtype({"FALSE":0, "TRUE":1}, basetype=np.int8)
I get an error:
ValueError: Unable to create dataset (no appropriate function for conversion path)
So my question is; how do preserve the enum type when using VirtualLayout?
You didn't post the code that creates the Virtual Layout, so it's hard to diagnose what you are doing wrong. I suspect it's something in that code segment. You should be able to do this. To demonstrate, I adapted the h5py example vds_simple.py
to create a Virtual Layout of boolean values from 4 HDF5 files/datasets of booleans.
The example is self-contained. It creates 4 'source' HDF5 files ( #_bool.h5
), each with a 1D dataset containing a (1,10) slice from a (4,10) array of booleans. It then creates a separate file ( VDS_bool.h5
), with a single 4x10 virtual dataset that exposes the 4 sources as 1 dataset. In addition, the original random boolean array is added for comparison. Output shows result of testing the 2 datasets plus the dataset contents.
Code below:
# Array for random sampling
sample_arr = [True, False]
bool_arr = np.random.choice(sample_arr, size=40).reshape(4,10)
# Create source files (0.h5 to 3.h5)
for n in range(4):
with h5py.File(f"{n}_bool.h5", "w") as f:
d = f.create_dataset("bdata", data=bool_arr[n])
# Assemble virtual dataset
layout = h5py.VirtualLayout(shape=(4, 10), dtype='bool')
for n in range(4):
filename = "{}_bool.h5".format(n)
vsource = h5py.VirtualSource(filename, "bdata", shape=(10,))
layout[n] = vsource
# Add virtual dataset to output file
with h5py.File("VDS_bool.h5", "w") as f:
f.create_virtual_dataset("vdata", layout)
f.create_dataset("bdata", data=bool_arr)
# read data back
# virtual dataset is transparent for reader!
with h5py.File("VDS_bool.h5", "r") as f:
print("Virtual vs Normal dataset Equality test:")
print(np.array_equal(f["vdata"][:],f["bdata"][:]),"\n")
print("Virtual boolean dataset:")
print(f["vdata"][:])
print("Normal boolean dataset:")
print(f["bdata"][:])
For reference, here are results from h5dump
for the datasets. Both virtual and normal dataset datatypes are the same, but slightly different from yours:
DATATYPE H5T_ENUM {
H5T_STD_I8LE;
h5dump -H 0_bool.h5
HDF5 "0_bool.h5" {
GROUP "/" {
DATASET "bdata" {
DATATYPE H5T_ENUM {
H5T_STD_I8LE;
"FALSE" 0;
"TRUE" 1;
}
DATASPACE SIMPLE { ( 10 ) / ( 10 ) }
}
}
}
h5dump -H VDS_bool.h5
HDF5 "VDS_bool.h5" {
GROUP "/" {
DATASET "bdata" {
DATATYPE H5T_ENUM {
H5T_STD_I8LE;
"FALSE" 0;
"TRUE" 1;
}
DATASPACE SIMPLE { ( 4, 10 ) / ( 4, 10 ) }
}
DATASET "vdata" {
DATATYPE H5T_ENUM {
H5T_STD_I8LE;
"FALSE" 0;
"TRUE" 1;
}
DATASPACE SIMPLE { ( 4, 10 ) / ( 4, 10 ) }
}
}
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.