简体   繁体   中英

Resizing and storing dataset in .h5 format using h5py in python

I am trying to resize dataset and store new values using h5py package in python. My dataset size keeps increasing at every time instance, and I would like to append the .h5 file using the resize function. However, I run into errors using my approach. The variable dset is an array of datasets.

import os
import h5py
import numpy as np

path = './out.h5'
os.remove(path)

def create_h5py(path):
    with h5py.File(path, "a") as hf:
        grp = hf.create_group('left')
        dset = []
        dset.append(grp.create_dataset('voltage', (10**4,3), maxshape=(None,3), dtype='f', chunks=(10**4,3)))
        dset.append(grp.create_dataset('current', (10**4,3), maxshape=(None,3), dtype='f', chunks=(10**4,3)))
        return dset

if __name__ == '__main__':
    dset = create_h5py(path)
    for i in range(3):

        if i == 0:
            dset[0][:] = np.random.random(dset[0].shape) 
            dset[1][:] = np.random.random(dset[1].shape)
        else:
            dset[0].resize(dset[0].shape[0]+10**4, axis=0)
            dset[0][-10**4:] = np.random.random((10**4,3))
            dset[1].resize(dset[1].shape[0]+10**4, axis=0)
            dset[1][-10**4:] = np.random.random((10**4,3))

EDIT

Thanks to tel I was able to solve this. Replace with h5py.File(path, "a") as hf: with hf = h5py.File(path, "a") .

The problem

Not sure about the rest of your code, but you can't use the context manager pattern (ie with h5py.File(foo) as bar: ) within a function that returns a dataset. As you point out in the comment under your question, this means that by the time you try to access the dataset the actual HDF5 file will have already closed. The dataset objects in h5py are like live views into the file, so they require the file remain open in order to use them. Thus, you're getting errors.

A solution

It's a good idea to always interact with files within a managed context (ie within a with clause). If your code throws an error, the context manager will (almost always) ensure that the file is closed. This helps avoid any potential losses of data resulting from a crash.

In your case, you can have your cake (encapsulate your dataset creation routines in a separate function) and eat it too (interact with the HDF5 file within a managed context) by writing your own context manager to look after the file for you.

It's actually pretty simple to code. Any Python object that implements the __enter__ and __exit__ methods is a valid context manager. Here's a complete working version:

import os
import h5py
import numpy as np

path = './out.h5'
try:
    os.remove(path)
except OSError: 
    pass

class H5PYManager:
    def __init__(self, path, method='a'):
        self.hf = h5py.File(path, method)

    def __enter__(self):
        # when you call `with H5PYManager(foo) as bar`, the return of this method will be assigned to `bar`
        return self.create_datasets()

    def __exit__(self, type, value, traceback):
        # this method gets called when you exit the `with` clause, including when an error is raised
        self.hf.close()    

    def create_datasets(self):
        grp = self.hf.create_group('left')
        return [grp.create_dataset('voltage', (10**4,3), maxshape=(None,3), dtype='f', chunks=(10**4,3)),
                grp.create_dataset('current', (10**4,3), maxshape=(None,3), dtype='f', chunks=(10**4,3))]

if __name__ == '__main__':
    with H5PYManager(path) as dset:
        for i in range(3):
            if i == 0:
                dset[0][:] = np.random.random(dset[0].shape) 
                dset[1][:] = np.random.random(dset[1].shape)
            else:
                dset[0].resize(dset[0].shape[0]+10**4, axis=0)
                dset[0][-10**4:] = np.random.random((10**4,3))
                dset[1].resize(dset[1].shape[0]+10**4, axis=0)
                dset[1][-10**4:] = np.random.random((10**4,3))

@tel provided an elegant solution to the problem. I outlined a simpler approach in my comments below his answer. It is simpler for a beginner to code (and understand). Basically, it there a few minor changes to @Maxtron's original code. Modifications are:

  • move with h5py.File(path, "a") as hf: to __main__ routine
  • pass hf in create_h5py(hf)
  • I also added a test before os.remove() to avoid errors if the h5 file doesn't exist

My suggested modifications below:

import h5py, os
import numpy as np

path = './out.h5'
# test existence of H5 file before deleting
if  os.path.isfile(path):
    os.remove(path)

def create_h5py(hf):
    grp = hf.create_group('left')
    dset = []
    dset.append(grp.create_dataset('voltage', (10**4,3), maxshape=(None,3), dtype='f', chunks=(10**4,3)))
    dset.append(grp.create_dataset('current', (10**4,3), maxshape=(None,3), dtype='f', chunks=(10**4,3)))
    return dset

if __name__ == '__main__':

    with h5py.File(path, "a") as hf:
        dset = create_h5py(hf)
        for i in range(3):

            if i == 0:
                dset[0][:] = np.random.random(dset[0].shape) 
                dset[1][:] = np.random.random(dset[1].shape)
            else:
                dset[0].resize(dset[0].shape[0]+10**4, axis=0)
                dset[0][-10**4:] = np.random.random((10**4,3))
                dset[1].resize(dset[1].shape[0]+10**4, axis=0)
                dset[1][-10**4:] = np.random.random((10**4,3))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM