简体   繁体   中英

Jagged Numpy Arrays in Boost Numpy

I need to efficiently share data between Python and C++ and use Boost Python and Boost Numpy for this. It works very well for "cartesian" arrays. For jagged arrays I am not sure if a direct indexing is possible or not. Here is the example I that shows how I extract the second array from a jagged numpy array:

np::ndarray jagged_array_length(np::ndarray& x) {
    int xn = x.shape(0);
    np::dtype dt = np::dtype::get_builtin<int>();
    p::tuple shape = p::make_tuple(xn);
    np::ndarray a = np::zeros(shape, dt);

    for (int i = 0; i < xn; i++) {
        np::ndarray row = p::extract<np::ndarray>(x[i]);
        a[i] = row.shape(0);
        // use row to access the elements ....
    }

    return a;
}

As expected:

>>> a
array([array([1, 2, 3]), array([10, 20])], dtype=object)
>>> jagged_array_length(a)
array([3, 2], dtype=int32)

Question 1: Is this the proper way of handling jagged arrays?

Question 2: I create an object np::ndarray row to index into the second dimensions. This seems a potential performance bottleneck. Can that somehow be avoided? Can I use somehow the length of the 2nd arrays and index directly into the data buffer? I am not sure if there is padding or if there is a contiguous block of memory behind a jagged array? I assume it is not.

Alternatively it would be possible to pad the jagged array beforehand:

def fill_array(item, max_len, fill=np.nan):
    item_len = len(item)
    to_fill = [fill] * (max_len - item_len)
    as_list = list(item) + to_fill
    return as_list

def block_array(jagged_array, max_len, fill=np.nan) -> np.ndarray:
    return np.asarray([fill_array(item, max_len, fill) for item in jagged_array])

which does a padding

>>> a
array([array([1, 2, 3]), array([10, 20])], dtype=object)
>>> a.shape
(2,)
>>> block_array(a,3)
array([[ 1.,  2.,  3.],
       [10., 20., nan]])
>>> b=block_array(a,3)
>>> b.shape
(2, 3)
>>> b
array([[ 1.,  2.,  3.],
       [10., 20., nan]])

But this is quite slow and I did not find a very fast way to handle it, even by using numba.

Any suggestions to improve the performance?

Arrays of arrays are generally inefficient because the sub array are object causing Numpy to fallback on a slow path interacting with the interpreter (due to reference counting, checks, indirections, etc.) or to implicitly convert them into a big array internally (which is generally not possible with jagged arrays). AFAIK, your Boost code should also interact with the interpreter internally in this case. In fact An array of arrays is generally slower than a list of array because Numpy arrays are not built-ins type so CPython needs to calls Numpy functions doing many checks over and over. Additionally, Numpy does not support jagged array natively (see this related post ).

Question 1: Is this the proper way of handling jagged arrays?

This is the usual way but clearly not a fast way. An efficient way to encore jagged array is to concatenate them in a big 1D array and use an additional array to store the start/stop indices (or offset/size informations). That way enable you to still use some basic Numpy vectorized methods on all arrays of sub-slices. Numba can be used to speed up the iteration over sub-arrays. The same thing applies for C++ with Boost.

Question 2: I create an object np::ndarray row to index into the second dimensions. This seems a potential performance bottleneck. Can that somehow be avoided? Can I use somehow the length of the 2nd arrays and index directly into the data buffer? I am not sure if there is padding or if there is a contiguous block of memory behind a jagged array? I assume it is not.

AFAIK, the performance bottleneck is due to the interaction with the CPython interpreter (via the CPython API) so to deal with CPython object. The above solution solve this problem since you only need to read an integer from the slicing array. The jagged array can be seen as a custom type or as a simple tuple of two arrays (possibly three if you want to split the start/stop or offsets/size). This representation is a bit similar to sparse matrices.

Note that arrays of arrays objects are indeed not stored contiguously in memory (each array object is stored are a different location in memory independently of other arrays that is dependant of the underlying CPython allocator). Arrays of arrays objects are typically stored as a pointer to an object structure containing a pointer to a memory buffer that contains pointers to objects containing each a pointer to other memory buffers. This cause a lot of pointer indirections and thus bad performance.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM