Most efficient way to operate on a n-dim array based on a reference n-dim array

Question

I have two numpy arrays of the same shape: dat_ara and ref_ara .

I would like perform an operation op_func on axis = -1 of dat_ara , however I would only like to operate on a selected slice of values in each array, the slice is specified when a threshold value thres is crossed by the reference array ref_ara .

For illustration, in the simple case where the arrays are just 2-dim, I have:

thres = 4

op_func = np.average

ref_ara = array([[1, 2, 1, 4, 3, 5, 1, 5, 2, 5],
                 [1, 2, 2, 1, 1, 1, 2, 7, 5, 8],
                 [2, 3, 2, 5, 1, 6, 5, 2, 7, 3]]) 

dat_ara = array([[1, 0, 0, 1, 1, 1, 1, 0, 1, 1],
                 [1, 1, 1, 1, 1, 1, 1, 0, 1, 0],
                 [1, 0, 1, 1, 1, 1, 0, 1, 1, 1]])

We see that thres is breached in the 5th, 7th and 3rd index of the 1st, 2nd and 3rd array in axis=0 of ref_ara . Therefore the outcome I desire would be

out_ara = array([op_func(array([1, 0, 0, 1, 1, 1]), 
                 op_func(array([1, 1, 1, 1, 1, 1, 1, 0]),
                 op_func(array([1, 0, 1, 1])])

This problem is difficult because it requires referencing to ref_ara . If that were not the case, I could simple use numpy.apply_along_axis .

I have tried expanding the dimensions of the two arrays to associate them to for computation, ie:

assos_ara = np.append(np.expand_dims(dat_ara, axis=-1), np.expand_dims(ref_ara, axis=-1), axis=-1)

But again, numpy.apply_along_axis requires the input function to only operate on 1-dim arrays, and thus I still cannot utilise the function.

The only other way I know is to iterate through the arrays index wise, however, with the arrays having constant changing dimensions of the two arrays, this is a tricky matter, moreover, it is not computationally efficient.

I would like as much to use vectorised functions to aid this process. What is the most efficient way to go about it?

Answer 1

This is a good use-case for masked arrays, since they allow you to perform normal numpy operations on portions of your data.

Let's assume that every row contains at least one value that is greater than the threshold. You can compute the indices of the break points as

breaks = np.argmax(ref_ara > thres, axis=-1)   # 5, 7, 3

You can then create a mask using the answer to the question I had linked earlier. Masks are generally the best way to deal with irregularly shaped data in numpy.

mask = np.arange(ref_ara.shape[-1]) <= breaks.reshape(*breaks.shape, 1)

Here, we don't need to do anything fancy to the arange , because it is along the last dimension. If that was not the case, you would want to insert a 1 into the shape of breaks where the range would go, and pad the tail of the range's shape with ones as well.

Now the masked array and ufunc solutions diverge slightly. The masked array version is more general, so it comes first:

data = np.ma.array(data_ara, mask=~mask)

Masked arrays interpret the mask backwards from how normal boolean indexing does, so we invert the mask. Alternatively, you could compute the mask with > instead of <= . The computation is now trivial:

out_ara = np.ma.average(data, axis=-1).data

A much less general alternative is to break your operation down into ufuncs, and use the masking they provide as well. This is easy for np.average , which is just np.sum and np.divide , but may be harder for more complicated operations.

As of numpy 1.17.0, np.sum has a where keyword:

out_ara = np.sum(dat_ara, where=mask, axis=-1) / breaks

Most efficient way to operate on a n-dim array based on a reference n-dim array

Question

1 answers

solution1
1 ACCPTED 2019-10-29 14:40:24

Most efficient way to operate on a n-dim array based on a reference n-dim array

Question

1 answers

solution1 1 ACCPTED 2019-10-29 14:40:24

solution1
1 ACCPTED 2019-10-29 14:40:24