简体   繁体   中英

suppress numpy array creation protocol for numpy arrays of objects

I am trying to build a library which reads complex HDF5 data files in python.

I am running into a problem where, an HDF5 Dataset somehow implements the default array protocol (sometimes), such that when a numpy array is created from it, it casts to the particular array type.

In [8]: ds
Out[8]: <HDF5 dataset "two_by_zero_empty_matrix": shape (2,), type "<u8">

In [9]: ds.value
Out[9]: array([2, 0], dtype=uint64)

This Dataset object, implements the numpy array protocol, and when the dataset consists of numbers, it supplies a default array type.

In [10]: np.array(ds)
Out[10]: array([2, 0], dtype=uint64)

However, if the dataset doesn't consist of numbers, but some other objects, as you would expect, it just uses a numpy array of type np.object :

In [43]: ds2
Out[43]: <HDF5 dataset "somecells": shape (2, 3), type "|O8">

In [44]: np.array(ds2)
Out[44]: 
array([[<HDF5 object reference>, <HDF5 object reference>,
        <HDF5 object reference>],
       [<HDF5 object reference>, <HDF5 object reference>,
        <HDF5 object reference>]], dtype=object)

This behavior might seem convenient but in my case it's actually inconvenient since it interferes with my recursive traversal of the data file. Working around it really turns out to be difficult since there a lot of different possible data types which have to be special-cased a little differently depending on whether they are children of objects or arrays of numbers.

My question is this: is there a way to suppress the default array creation protocol, such that I could create an object array out of dataset objects that want to cast to their natural duck types?

That is, I want something like: np.array(ds, dtype=object) , which will produce an array of [<Dataset object of type int>, dtype=object] and not [3 4 5, dtype=int] .

But np.array(ds, dtype=np.object) throws IOError: Can't read data (No appropriate function for conversion path)

I tried in earnest to google some documentation about the numpy array protocol works, and found a lot, but it doesn't really appear to me that anyone considered the possibility that someone might want this behavior.

I can understand where the Out[44] is coming from. It's an array containing pointers to objects, in this case h5py references to objects on the file (I think).

With np.array(ds, dtype=object) are you trying to create something more like this, rather than the 'normal' array that you get with np.array(ds) ? array([2, 0], dtype=uint64) .

But what is the parallel array? A single element array with a pointer to ds ? Or a 2 element array with pointers to 2 and 0 somewhere on the file? What if they aren't <HDF5 object reference> ?

In numpy , without any h5py stuff, I can create an object array from a list of values:

In [104]: np.array([2,0], dtype=object)
Out[104]: array([2, 0], dtype=object)

Or I can start with an empty array (filled with None ) and assign values:

In [105]: x=np.empty((2,), dtype=object)
In [106]: x[0]=2
In [107]: x[1]=0
In [108]: x
Out[108]: array([2, 0], dtype=object)

I guess you could try:

x[0] = ds[0]
or
x[:] = ds[:]

Or make a single element object array

x = np.empty((), dtype=object)
x[()] = ds

I don't have a h5py test file open on my Ipython session to test this. But I can do something weird like make an object array that contains itself. I can work with, but I can't display it without getting a recursion error.

In [118]: x=np.empty((),dtype=object)
In [119]: x[()]=x
In [120]: x1=x[()]
In [121]: x1==x
Out[121]: True

I have a small h5py file open on another terminal:

In [315]: list(f.keys())
Out[315]: ['d', 'x', 'y']
In [317]: f['d']    # the group
Out[317]: <HDF5 group "/d" (2 members)>

x is a string:

In [318]: f['x']    # a single element (a string)
Out[318]: <HDF5 dataset "x": shape (), type "|O4">
In [330]: f['x'].value
Out[330]: 'astring'
In [331]: np.array(f['x'])
Out[331]: array('astring', dtype=object)

y is an array:

In [320]: f['y'][:]
Out[320]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [321]: f['y'].value
Out[321]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [322]: np.array(f['y'])
Out[322]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [323]: timeit np.array(f['y'])
1000 loops, best of 3: 364 µs per loop
In [324]: timeit f['y'].value
1000 loops, best of 3: 380 µs per loop

So access with value and array is equivalent.

Access as object array gives the same sort of error as you got.

In [325]: np.array(f['y'],dtype=object)
...
OSError: can't read data (Dataset: Read failed)

Conversion to float works fine:

In [326]: np.array(f['y'],dtype=float)
Out[326]: array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])

And the assignment to a predefined object array works:

In [327]: x=np.empty((),dtype=object)
In [328]: x[()]=f['y']
In [329]: x
Out[329]: array(<HDF5 dataset "y": shape (10,), type "<i4">, dtype=object)

Trying to create a 10 element array to take y :

In [332]: y1=np.empty((10,),dtype=object)
In [333]: y1[:]=f['y']
...
OSError: can't read data (Dataset: Read failed)
In [334]: y1[:]=f['y'].value
In [335]: y1
Out[335]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=object)

y1[:]=f['y'][:] also works

I can't assign dataset to y1 (same error as when I tried np.array(f['y'],dtype=object) . But I can assign its values. I can even assign the dataset to one element of y1

In [338]: y1[-1]=f['y']
In [339]: y1
Out[339]: 
array([0, 1, 2, 3, 4, 5, 6, 7, 8,
       <HDF5 dataset "y": shape (10,), type "<i4">], dtype=object)

I keep coming back to the basic idea that an object array is just a collection of pointers, essentially a list in an array wrapper.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM