Am I able to convert a directory path into something that can be fed into a python hdf5 data table?

Question

I was wondering, how to convert a string or path into something that can be fed into a hdf5 table. For example, I am returning a numpy img array, label, and path to the image, from a Pytorch dataloader, where the path to the image would look like this:

('mults/train/0/5678.ndpi/40x/40x-236247-16634-80384-8704.png',)

I basically want to feed it into a hdf5 table like this:

hdf5_file = h5py.File(path, mode='w')
hdf5_file.create_dataset(str(phase) + '_img_paths', (len(dataloaders_dict[phase]),))

I'm not really sure if what I want to do is feasible. Maybe I am wrong to feed into such data into a table.

I have tried:

hdf5_file.create_dataset(str(phase) + '_img_paths', (len(dataloaders_dict[phase]),),dtype="S10")

But get this error:

 hdf5_file[str(phase) + '_img_paths'][i] = str(paths40x)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/anaconda3/lib/python3.6/site-packages/h5py/_hl/dataset.py", line 708, in __setitem__
    self.id.write(mspace, fspace, val, mtype, dxpl=self._dxpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 211, in h5py.h5d.DatasetID.write
  File "h5py/h5t.pyx", line 1652, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1713, in h5py.h5t.py_create
TypeError: No conversion path for dtype: dtype('<U64')

Answer 1

You have a couple of choices when it comes to saving string data:

You can create a standard dataset in h5py or PyTables, and define with an arbitrarily large string size. This is the simplest method, but runs the risk that your arbitrarily large string isn't large enough. :)
Alternately, can create a Variable Length dataset. PyTables calls this dataset type a VLArray and the object it uses is Class VLStringAtom(). h5py uses a standard dataset, but the dtype references special_dtype(vlen=str) (Note if you are using h5py 2.10 you can use string_dtype() instead).

I created an example that shows how to do this for both PyTables and h5py. It is built around the procedures referenced in your comments. I did not copy all of the code -- just what was necessary to retrieve file names and shuffle them. Also, the kaggle dataset I found has a different directory structure, so I modified cat_dog_train_path variable to match.

from random import shuffle
import glob
shuffle_data = True  # shuffle the addresses before saving
cat_dog_train_path = '.\PetImages\*\*.jpg'

# read addresses and labels from the 'train' folder
addrs = glob.glob(cat_dog_train_path, recursive=True)
print (len(addrs))
labels = [0 if 'cat' in addr else 1 for addr in addrs]  # 0 = Cat, 1 = Dog

# to shuffle data
if shuffle_data:
    c = list(zip(addrs, labels))
    shuffle(c)
    addrs, labels = zip(*c)

# Divide the data into 10% train only, no validation or test
train_addrs = addrs[0:int(0.1*len(addrs))]
train_labels = labels[0:int(0.1*len(labels))]

print ('Check glob list data:')
print (train_addrs[0])
print (train_addrs[-1])

import tables as tb

# Create a hdf5 file with PyTaables and create VLArrays
# filename to save the hdf5 file
hdf5_path = 'PetImages_data_1.h5'  
with tb.File(hdf5_path, mode='w') as h5f:
    train_files_ds = h5f.create_vlarray('/', 'train_files', 
                                        atom=tb.VLStringAtom() )
    # loop over train addresses
    for i in range(len(train_addrs)):
        # print how many images are saved every 1000 images
        if i % 500 == 0 and i > 1:
            print ('Train data: {}/{}'.format(i, len(train_addrs)) )
        addr = train_addrs[i]
        train_files_ds.append(train_addrs[i].encode('utf-8'))

with tb.File(hdf5_path, mode='r') as h5f:
    train_files_ds = h5f.root.train_files
    print ('Check PyTables data:')
    print (train_files_ds[0].decode('utf-8'))
    print (train_files_ds[-1].decode('utf-8'))

import h5py

# Create a hdf5 file with h5py and create VLArrays
# filename to save the hdf5 file
hdf5_path = 'PetImages_data_2.h5'  
with h5py.File(hdf5_path, mode='w') as h5f:
    dt = h5py.special_dtype(vlen=str) # can use string_dtype() wiuth h5py 2.10
    train_files_ds = h5f.create_dataset('/train_files', (len(train_addrs),), 
                                        dtype=dt )

    # loop over train addresses
    for i in range(len(train_addrs)):
        # print how many images are saved every 1000 images
        if i % 500 == 0 and i > 1:
            print ('Train data: {}/{}'.format(i, len(train_addrs)) )
        addr = train_addrs[i]
        train_files_ds[i]= train_addrs[i]

with h5py.File(hdf5_path, mode='r') as h5f:
    train_files_ds = h5f['train_files']
    print ('Check h5py data:')
    print (train_files_ds[0])
    print (train_files_ds[-1])

Am I able to convert a directory path into something that can be fed into a python hdf5 data table?

Question

1 answers

solution1
0 ACCPTED 2019-11-15 01:52:03

Am I able to convert a directory path into something that can be fed into a python hdf5 data table?

Question

1 answers

solution1 0 ACCPTED 2019-11-15 01:52:03

solution1
0 ACCPTED 2019-11-15 01:52:03