Apply function along ranges in numpy array

Question

Say I have the following numpy array:

a = np.arange(20)

And also an array containing indices as follows:

ix = np.array([4,10,15])

I've been trying to come up with a vectorized solution to the following question: How can I apply a function along a being splitted using the indices in ix ?

So say I where to split a with np.split (I'm only using np.split here to illustrate the groups to which I would like to apply a function here):

np.split(a,ix)

[array([0, 1, 2, 3]),
 array([4, 5, 6, 7, 8, 9]),
 array([10, 11, 12, 13, 14]),
 array([15, 16, 17, 18, 19])]

And say for instance I'd like to take the sum on each chunk, so giving:

[6, 39, 60, 85]

How could I vectorize this using numpy ?

Answer 1

I do not know if this is the best solution, but you could convert the list of arrays with different sizes to list of array of fixed size by adding zeros. And then implement a function like sum that does not get affected by zeros.

See an example below.

a = np.arange(20)
ix = np.array([4,10,15])
b = np.split(a,ix)
print(b)

results in

[array([0, 1, 2, 3]),
 array([4, 5, 6, 7, 8, 9]),
 array([10, 11, 12, 13, 14]),
 array([15, 16, 17, 18, 19])]

Then use itertools to convert list to array from here

import itertools
c = np.array(list(itertools.zip_longest(*b, fillvalue=0))).T
print(c)

which results in

[[ 0  1  2  3  0  0]
 [ 4  5  6  7  8  9]
 [10 11 12 13 14  0]
 [15 16 17 18 19  0]]

then sum it using

np.sum(c, axis = 1)

results in

array([ 6, 39, 60, 85])

Answer 2

split produces a list of arrays, which may differ in length. It actually does so iteratively

In [12]: alist = []
In [13]: alist.append(a[0:idx[0]])
In [14]: alist.append(a[idx[0]:idx[1]])
In [15]: alist.append(a[idx[1]:idx[2]])
....

Applying sum to each element of the list individually makes sense:

In [11]: [np.sum(row) for row in alist]
Out[11]: [6, 39, 60, 85]

When you have a list of arrays that differ in shape, it's a good bet that you'll have to do a Python level iteration on it.

Fast 'vectorize' means performing the calculations in compiled code. Most that is built around multidimensional arrays, eg 2d ones. If your split had produced equal size array, you could use np.sum with the appropriate axis parameter.

In [23]: a1 = a.reshape(4,5)
In [24]: a1
Out[24]: 
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])
In [25]: np.sum(a1, axis=1)
Out[25]: array([10, 35, 60, 85])

Sometimes we can play tricks to cast the problem into a nd one, for example if your first array of the split were padded with a 0. But that casting itself might require iteration.

As raised here (and its links) Origin of AttributeError: object has no attribute 'cos' math ( ufunc ) functions applied to object dtype arrays, ends up delegating the action to corresponding methods of the objects. But that still involves a (near)Python level iteration over the objects.

Some timings:

In [57]: timeit [np.sum(row) for row in alist]
31.7 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [58]: timeit np.sum(list(itertools.zip_longest(*alist, fillvalue=0)),axis=0)
25.2 µs ± 82 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [59]: timeit np.nansum(pd.DataFrame(alist), axis=1)
908 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [61]: timeit np.frompyfunc(sum,1,1)(alist)
12.9 µs ± 21.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In this last case the Python sum is faster than than np.sum . But that's true with the list comprehension as well:

In [63]: timeit [sum(row) for row in alist]
6.86 µs ± 13.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And with Divakar's wiz-bang fillna , Numpy: Fix array with rows of different lengths by filling the empty elements with zeros

In [70]: timeit numpy_fillna(np.array(alist)).sum(axis=1)
44.2 µs ± 208 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Once you have a multidimensional array, the numpy code is fast. But if start with a list, even a list of arrays, Python list methods often are faster. The time it takes to construct an array (or Dataframe) is never trivial.

Answer 3

A pandas solution will be:

import numpy as np
import pandas as pd

a = np.arange(20)

ix = np.array([4, 10, 15])

data = pd.DataFrame(np.split(a, ix))

print(np.nansum(data, axis=1))

Output

[ 6. 39. 60. 85.]

Apply function along ranges in numpy array

Question

3 answers

solution1
1 2019-02-05 18:28:05

solution2
1 2019-02-05 18:28:15

solution3
1 2019-02-05 18:37:22

Apply function along ranges in numpy array

Question

3 answers

solution1 1 2019-02-05 18:28:05

solution2 1 2019-02-05 18:28:15

solution3 1 2019-02-05 18:37:22

solution1
1 2019-02-05 18:28:05

solution2
1 2019-02-05 18:28:15

solution3
1 2019-02-05 18:37:22