简体   繁体   English

Numpy 如何在向量化中排除数组?

[英]Numpy how to exclude array in vectorization?

I have a function:我有一个 function:

import numpy as np

t = np.array([['t1', 0],['t2',0],['t3',1],['t4',1]])
i = np.array(['t3', 't4'])

def myfunc(d, x): 
    return d[:,1][np.where(d[:,0] == x)]

vfunc = np.vectorize(myfunc, excluded=['d'])
vfunc(d=t,x=i)

Expected output is: array(['1', '1'], dtype='<U2')预期 output 为: array(['1', '1'], dtype='<U2')

Gives the error: ValueError: setting an array element with a sequence给出错误: ValueError: setting an array element with a sequence

I don't see why this doesn't work following exclusion argument in documentation: https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html我不明白为什么在文档中的排除参数之后这不起作用: https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html

As hpaulj said Know, test, don't guess.正如 hpaulj 所说,知道,测试,不要猜测。

It was bit difficult to pin point, but here it is.很难确定,但就是这样。

Docs says By default, pyfunc is assumed to take scalars as input and output.文档说By default, pyfunc is assumed to take scalars as input and output.

To remove this error add otypes argument, you will not get your exact output but numpy can atleast figure out correctly what it needs to return要删除此错误,请添加 otypes 参数,您将无法获得确切的 output 但 numpy 至少可以正确找出它需要返回的内容

vfunc = np.vectorize(myfunc, excluded=['d'], otypes='O')
signature : string, optional

    Generalized universal function signature, e.g., (m,n),(n)->(m) for
 vectorized matrix-vector multiplication. If provided, pyfunc will be called
 with (and expected to return) arrays with shapes given by the size of
 corresponding core dimensions. By default, pyfunc is assumed to take scalars
 as input and output.

Below is detailed reason for proposed solution以下是建议解决方案的详细原因

Your function does not return a scalar output, it returns a vector output for each input scalar.您的 function 不返回标量 output,它为每个输入标量返回向量 output。 While input is a scalar (t3,) and (t4,), there actual output are vector array(['1'], dtype='<U2') , array(['1'], dtype='<U2')] respectively.虽然输入是标量 (t3,) 和 (t4,),但实际的 output 是向量array(['1'], dtype='<U2') , array(['1'], dtype='<U2')]分别。

Since you don't specify the signature, by default numpy thinks outputs are scalar and tries to put them in numpy array with dtypes as dtype of input(But actual dtype is object), trying to make a vector.由于您没有指定签名,默认情况下 numpy 认为输出是标量并尝试将它们放入 numpy 数组中,其中 dtypes 作为输入的 dtype(但实际 dtype 是对象),试图制作一个向量。

That is the error ValueError: setting an array element with a sequence , because your ufunc output is vector for each input scalar and not a scalar and also no signature is defined(reason for why numpy assumes output is scalar). That is the error ValueError: setting an array element with a sequence , because your ufunc output is vector for each input scalar and not a scalar and also no signature is defined(reason for why numpy assumes output is scalar).

Also look at example similar to yours in numpy vectorize docs, numpy uses scalar return type and not vector还可以在 numpy 向量化文档中查看与您类似的示例,numpy 使用标量返回类型而不是向量

def mypolyval(p, x):

    _p = list(p)

    res = _p.pop(0)

    while _p:

        res = res*x + _p.pop(0)

    return res

vpolyval = np.vectorize(mypolyval, excluded=['p'])

vpolyval(p=[1, 2, 3], x=[0, 1])
array([3, 6])

I come to this conclusion through debugging, since myself don't have enough experience using vectorization.我通过调试得出这个结论,因为我自己没有足够的使用矢量化的经验。

While debugging in function_base.py , at line L2257 outputs are accumulated and has value array([array(['1'], dtype='<U2'), array(['1'], dtype='<U2')], dtype=object) .function_base.py中调试时, L2257行的输出被累加并具有值array([array(['1'], dtype='<U2'), array(['1'], dtype='<U2')], dtype=object) Then at L2260 numpy tries to convert them into required dtype, but fails because it was assuming sequence of scalars, but got a sequence of sequence.然后在L2260 numpy 尝试将它们转换为所需的 dtype,但失败了,因为它假设了标量序列,但得到了序列序列。

Just put a breakpoint in vscode and try to see variable outputs you can figure out.只需在 vscode 中放置一个断点,然后尝试查看可以找出的变量输出。

np.vectorize does "vectorize" in the sense that it allows you to pass array(s) to a function that otherwise only works with scalar values. np.vectorize确实“矢量化”,因为它允许您将数组传递给 function,否则只能使用标量值。 But it can be tricky to use right, and it does not improve performance.但是正确使用可能会很棘手,并且不会提高性能。 It does not compile your function, so low level concepts of 'vectorization' do not apply.它不会编译您的 function,因此“矢量化”的低级概念不适用。


In [1]: t = np.array([['t1', 0],['t2',0],['t3',1],['t4',1]])
   ...: i = np.array(['t3', 't4'])
   ...: 
In [2]: def myfunc(d, x):
   ...:     return d[:,1][np.where(d[:,0] == x)]
   ...: 
   ...: vfunc = np.vectorize(myfunc, excluded=['d'])

Your problem - with full traceback你的问题 - 有完整的追溯

In [3]: vfunc(d=t,x=i)
Traceback (most recent call last):
  File "<ipython-input-3-ea7904300378>", line 1, in <module>
    vfunc(d=t,x=i)
  File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 2163, in __call__
    return self._vectorize_call(func=func, args=vargs)
  File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 2249, in _vectorize_call
    res = asanyarray(outputs, dtype=otypes[0])
ValueError: setting an array element with a sequence

What does your function produce?您的 function 生产什么? arrays!数组!

In [4]: myfunc(t,i[0])
Out[4]: array(['1'], dtype='<U21')
In [5]: myfunc(t,i[1])
Out[5]: array(['1'], dtype='<U21')

Lets try again, this time displaying the values passed to the function, and setting otypes .让我们再试一次,这次显示传递给 function 的值,并设置otypes Since you read enough to use exclude you must have come across the otypes parameter as well.由于您阅读了足够的内容以使用exclude ,因此您也必须遇到otypes参数。 This issue often causes problems for SO questioners.这个问题通常会给 SO 提问者带来问题。 (Based on the trial calculation your vectorize had set otypes to str , resulting in the ValueError ) (根据试验计算,您的vectorize已将otypes设置为str ,导致ValueError

In [6]: def myfunc(d, x):
   ...:     print(d,x)
   ...:     return d[:,1][np.where(d[:,0] == x)]
   ...: 
   ...: vfunc = np.vectorize(myfunc, excluded=['d'], otypes=['O'])
In [7]: vfunc(d=t,x=i)
[['t1' '0']
 ['t2' '0']
 ['t3' '1']
 ['t4' '1']] t3
[['t1' '0']
 ['t2' '0']
 ['t3' '1']
 ['t4' '1']] t4
Out[7]: 
array([array(['1'], dtype='<U21'), array(['1'], dtype='<U21')],
      dtype=object)

In [8]: np.hstack(_)
Out[8]: array(['1', '1'], dtype='<U21')

excluded did work, the whole d was passed each time. excluded确实有效,每次都通过了整个d An alternative would be to define the function to use a global array, t , rather than expect it as argument另一种方法是将 function 定义为使用全局数组t ,而不是将其作为参数

def myfunc(x):
   print(x)
   return t[:,1][np.where(t[:,0] == x)]

Let's generalize the arguments a bit让我们概括一下 arguments

In [12]: t = np.array([['t1', 0],['t2',0],['t3',1],['t4',1],['t4',10]])
    ...: i = np.array(['t3', 't4','t5'])
    ...: 
    ...: 
In [13]: vfunc(d=t,x=i)
[['t1' '0']
 ['t2' '0']
 ['t3' '1']
...
Out[13]: 
array([array(['1'], dtype='<U21'), array(['1', '10'], dtype='<U21'),
       array([], dtype='<U21')], dtype=object)

the result is 3 arrays with different sizes.结果是 3 个不同大小的 arrays。 hstack would still work. hstack仍然可以工作。

But with just one 1d argument, i , it would be just as easy, and faster, to use:但是只有一个一维参数i ,使用起来同样容易,而且更快:

In [14]: np.array([myfunc(t,j) for j in i])
[['t1' '0']
 ['t2' '0']
 ['t3' '1']
 ....
<ipython-input-14-207a42bdb1fb>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  np.array([myfunc(t,j) for j in i])
Out[14]: 
array([array(['1'], dtype='<U21'), array(['1', '10'], dtype='<U21'),
       array([], dtype='<U21')], dtype=object)

vectorization矢量化

This can have many meanings.这可以有很多含义。 simd vectorization is a new one for me, but then I haven't done low level ('C') programming in years. simd vectorization对我来说是一个新的,但是我多年来没有做过低级('C')编程。 np.vectorize does not compile or alter your function in anyway. np.vectorize都不会编译或更改您的 function 。 It has a clear performance disclaimer (I suppose it could be in bold and at the start of the docs).它有一个明确的性能免责声明(我想它可能是粗体并且在文档的开头)。 It's intended primarily for functions that take scalar values, and return similarly simple results, and is most useful when it takes several arguments, and for which you want to take advantage of numpy broadcasting .它主要用于采用标量值并返回类似简单结果的函数,并且在需要多个 arguments 并且您希望利用numpy broadcasting的情况下最有用。

The vectorized function evaluates `pyfunc` over successive tuples
of the input arrays like the python map function, except it uses the
broadcasting rules of numpy.

Informally (and without documented justification) we talk about numpy 'vectorization' as a way of making big performance gains.非正式地(并且没有记录的理由)我们谈论numpy '矢量化'作为一种获得巨大性能提升的方式。 What that really means is moving iterations from the Python level (for loops, list comprehensions) to compiled numpy methods.这真正意味着将迭代从 Python 级别(for 循环,列表推导)移动到编译的 numpy 方法。 It's interpreter vs compiler difference.这是解释器与编译器的区别。 np.vectorize , despite the name, does not do this. np.vectorize尽管有这个名字,但并没有这样做。

A way of using numpy methods is:使用 numpy 方法的一种方法是:

In [17]: t[:,0]==i[:,None]
Out[17]: 
array([[False, False,  True, False, False],
       [False, False, False,  True,  True],
       [False, False, False, False, False]])

By adding a dimension to i , we can test t[:,0] against all i values at once.通过向i添加维度,我们可以一次针对所有i值测试t[:,0] Applying where to that to get the indices:应用where以获取索引:

In [19]: np.where(t[:,0]==i[:,None])
Out[19]: (array([0, 1, 1]), array([2, 3, 4]))

and using that to index t :并使用它来索引t

In [20]: t[_[1],1]
Out[20]: array(['1', '1', '10'], dtype='<U21')

Or we could use the boolean mask in Out[17] row by row:或者我们可以在 Out[17] 中逐行使用 boolean 掩码:

In [21]: [t[j,1] for j in _17]
Out[21]: 
[array(['1'], dtype='<U21'),
 array(['1', '10'], dtype='<U21'),
 array([], dtype='<U21')]

With the possible mix of array sizes, it is hard to do this without some sort of python level iteration.由于数组大小可能混合在一起,如果没有某种 python 级别的迭代,就很难做到这一点。 The fast numpy operations work on multidimensional arrays, not on "ragged" ones.快速 numpy 操作适用于多维 arrays,而不适用于“衣衫褴褛”的操作。

I could be missing something, but the other answers seem to be really overcomplicating this.我可能会遗漏一些东西,但其他答案似乎真的过于复杂了。 You just need 'd' to be a keyword argument, as the vectorize method seems to only be capable of interfering with the keywords dictionary.您只需要 'd' 作为关键字参数,因为 vectorize 方法似乎只能干扰关键字字典。

I was dealing with this problem myself and switching the argument to a keyword fixed it instantly.我自己正在处理这个问题并将参数切换为关键字立即修复它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM