简体   繁体   English

沿numpy数组中的范围应用函数

[英]Apply function along ranges in numpy array

Say I have the following numpy array: 说我有以下numpy数组:

a = np.arange(20)

And also an array containing indices as follows: 还有一个包含索引的数组,如下所示:

ix = np.array([4,10,15])

I've been trying to come up with a vectorized solution to the following question: How can I apply a function along a being splitted using the indices in ix ? 我一直在试图想出一个量化的解决了以下问题:我如何才能沿着应用功能, a使用的索引被分裂ix

So say I where to split a with np.split (I'm only using np.split here to illustrate the groups to which I would like to apply a function here): 所以说,我哪里拆分anp.split (我只使用np.split这里说明组,我想在这里应用功能):

np.split(a,ix)

[array([0, 1, 2, 3]),
 array([4, 5, 6, 7, 8, 9]),
 array([10, 11, 12, 13, 14]),
 array([15, 16, 17, 18, 19])]

And say for instance I'd like to take the sum on each chunk, so giving: 并举例说,我想取每个块的总和,因此给出:

[6, 39, 60, 85]

How could I vectorize this using numpy ? 我如何使用numpy将其向量化?

I do not know if this is the best solution, but you could convert the list of arrays with different sizes to list of array of fixed size by adding zeros. 我不知道这是否是最好的解决方案,但是您可以通过添加零将具有不同大小的数组列表转换为固定大小的数组列表。 And then implement a function like sum that does not get affected by zeros. 然后实现诸如sum之类的不受零影响的函数。

See an example below. 请参阅下面的示例。

a = np.arange(20)
ix = np.array([4,10,15])
b = np.split(a,ix)
print(b)

results in 结果是

[array([0, 1, 2, 3]),
 array([4, 5, 6, 7, 8, 9]),
 array([10, 11, 12, 13, 14]),
 array([15, 16, 17, 18, 19])]

Then use itertools to convert list to array from here 然后使用itertools 从此处将列表转换为数组

import itertools
c = np.array(list(itertools.zip_longest(*b, fillvalue=0))).T
print(c)

which results in 导致

[[ 0  1  2  3  0  0]
 [ 4  5  6  7  8  9]
 [10 11 12 13 14  0]
 [15 16 17 18 19  0]]

then sum it using 然后使用

np.sum(c, axis = 1)

results in 结果是

array([ 6, 39, 60, 85])

split produces a list of arrays, which may differ in length. split生成数组列表,其长度可能不同。 It actually does so iteratively 它实际上是反复进行的

In [12]: alist = []
In [13]: alist.append(a[0:idx[0]])
In [14]: alist.append(a[idx[0]:idx[1]])
In [15]: alist.append(a[idx[1]:idx[2]])
....

Applying sum to each element of the list individually makes sense: sum分别应用于列表的每个元素是有意义的:

In [11]: [np.sum(row) for row in alist]
Out[11]: [6, 39, 60, 85]

When you have a list of arrays that differ in shape, it's a good bet that you'll have to do a Python level iteration on it. 当您拥有形状各异的数组列表时,可以肯定的是,您将必须对其进行Python级迭代。

Fast 'vectorize' means performing the calculations in compiled code. 快速的“向量化”意味着以编译后的代码执行计算。 Most that is built around multidimensional arrays, eg 2d ones. 大多数是围绕多维数组构建的,例如二维数组。 If your split had produced equal size array, you could use np.sum with the appropriate axis parameter. 如果split产生的数组大小相等,则可以将np.sum与相应的axis参数一起使用。

In [23]: a1 = a.reshape(4,5)
In [24]: a1
Out[24]: 
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])
In [25]: np.sum(a1, axis=1)
Out[25]: array([10, 35, 60, 85])

Sometimes we can play tricks to cast the problem into a nd one, for example if your first array of the split were padded with a 0. But that casting itself might require iteration. 有时我们可以玩弄技巧,将问题转化为第一个问题,例如,如果拆分的第一个数组用0填充。但是该转化本身可能需要迭代。

As raised here (and its links) Origin of AttributeError: object has no attribute 'cos' math ( ufunc ) functions applied to object dtype arrays, ends up delegating the action to corresponding methods of the objects. 如此处(及其链接)所述,AttributeError的起源:对象没有将属性' ufuncufunc )函数应用于对象dtype数组,最终将操作委派给了对象的相应方法。 But that still involves a (near)Python level iteration over the objects. 但这仍然涉及对象的(近)Python级迭代。


Some timings: 一些时间:

In [57]: timeit [np.sum(row) for row in alist]
31.7 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [58]: timeit np.sum(list(itertools.zip_longest(*alist, fillvalue=0)),axis=0)
25.2 µs ± 82 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [59]: timeit np.nansum(pd.DataFrame(alist), axis=1)
908 µs ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [61]: timeit np.frompyfunc(sum,1,1)(alist)
12.9 µs ± 21.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In this last case the Python sum is faster than than np.sum . 在这最后一种情况下,Python的sumnp.sum快。 But that's true with the list comprehension as well: 但是列表理解也是这样:

In [63]: timeit [sum(row) for row in alist]
6.86 µs ± 13.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And with Divakar's wiz-bang fillna , Numpy: Fix array with rows of different lengths by filling the empty elements with zeros 使用Divakar的wiz-bang fillnaNumpy:通过用零填充空元素来修复具有不同长度的行的数组

In [70]: timeit numpy_fillna(np.array(alist)).sum(axis=1)
44.2 µs ± 208 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Once you have a multidimensional array, the numpy code is fast. 一旦有了多维数组,numpy代码就会很快。 But if start with a list, even a list of arrays, Python list methods often are faster. 但是,如果从列表开始,甚至从数组列表开始,Python列表方法通常会更快。 The time it takes to construct an array (or Dataframe) is never trivial. 构造数组(或数据框)所花费的时间从来都不短。

A pandas solution will be: 大熊猫解决方案将是:

import numpy as np
import pandas as pd

a = np.arange(20)

ix = np.array([4, 10, 15])

data = pd.DataFrame(np.split(a, ix))

print(np.nansum(data, axis=1))

Output 产量

[ 6. 39. 60. 85.]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM