简体   繁体   English

是否有矢量化方法来检查一维数组的第 i 个元素是否存在于 3D 数组的第 i 个元素上?

[英]Is there a vectorized way to check whether the ith element of a 1D array is present on the ith element of a 3D array?

I have a 1D array of length k with some arbitrary values and a 3D array of dimensions k * i * j with some data.我有一个带有一些任意值的长度为k 的一维数组和一个带有一些数据的维度为k * i * j的 3D 数组。

import numpy as np

# create 1D and 3D array
values = np.array([2, 5, 1], dtype=np.int)
arr = np.zeros((3, 4, 4), dtype=np.int)

# insert some random numbers in the 3D array
arr[0, 3, 2] = 5
arr[1, 1, 1] = 2
arr[2, 2, 3] = 1
>>> print(values)
[2 5 1]

>>> print(arr)
[[[0 0 0 0]
  [0 0 0 0]
  [0 0 0 0]
  [0 0 5 0]]

 [[0 0 0 0]
  [0 2 0 0]
  [0 0 0 0]
  [0 0 0 0]]

 [[0 0 0 0]
  [0 0 0 0]
  [0 0 0 1]
  [0 0 0 0]]]

My goal is to determine if the i th element of values ( ie a scalar) is present in the i th element of arr ( ie a 2D array) and get a boolean array of length k .我的目标是确定values第 i元素(标量)是否存在于arri元素(二维数组)中,并获得长度为k的布尔数组。

In my example, I would expect to get an array [False, False, True] as 1 is the only number present in its correspondent 2D array ( arr[2] ).在我的示例中,我希望得到一个数组[False, False, True]因为1是其对应的二维数组 ( arr[2] ) 中存在的唯一数字。

As np.isin function is not an option, I have come up with two possible solutions so far.由于np.isin函数不是一种选择,到目前为止我已经提出了两种可能的解决方案。

1) Create a 3D array by repeating the numbers in values and then do an elementh-wise comparison: 1) 通过重复values的数字创建一个 3D 数组,然后进行逐元素比较:

rep = np.ones(arr.shape) * values.reshape(-1, 1, 1)
>>> print(rep)
[[[2. 2. 2. 2.]
  [2. 2. 2. 2.]
  [2. 2. 2. 2.]
  [2. 2. 2. 2.]]

 [[5. 5. 5. 5.]
  [5. 5. 5. 5.]
  [5. 5. 5. 5.]
  [5. 5. 5. 5.]]

 [[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]]

>>> np.any((arr == rep), axis=(1, 2))
array([False, False,  True])

However, this approach seems like a bad idea from a memory perspective if both values and arr have bigger shapes.然而,如果valuesarr都具有更大的形状,那么从内存的角度来看,这种方法似乎是个坏主意。

2) Iterate over each value in values and check if it is present in its correspondent 2D array of arr . 2) 迭代values每个values并检查它是否存在于其对应的arr二维数组中。

result = []
for i, value in enumerate(values):
    result.append(value in arr[i])
>>> print(result)
[False, False, True]

This approach is of course better from a memory perspective but again, when implemented with bigger arrays it can become time consuming (think of k being 1000000 instead of 3).从内存的角度来看,这种方法当然更好,但同样,当使用更大的数组实现时,它会变得很耗时(想想k是 1000000 而不是 3)。

Is there any other numpy function I am missing or perhaps a better approach to accomplish my goal here?是否还有其他numpy功能我遗漏了,或者可能有更好的方法来实现我的目标?

I already took a look at the answers to a similar question but they do not fit my use case.我已经查看了类似问题的答案,但它们不适合我的用例。

using broadcasting might help:使用 广播可能有帮助:

np.any(values[:,None,None] == arr, axis=(1,2))

is a one liner that gives [False,False,True] .是一个单行,给出[False,False,True] note that if you're storing arr then storing a similar bool array shouldn't be too bad请注意,如果您要存储arr那么存储类似的bool数组应该不会太糟糕

note that it's the values[:,None,None] == arr that's doing the broasting, strangeindexing with None being equivalent to your reshape (but feels more idiomatic to me)请注意,正是values[:,None,None] == arr进行了炫耀,奇怪的索引, None相当于您的reshape (但对我来说感觉更惯用)

I have found that your problem is equivalent to我发现你的问题相当于

[np.any(arr[i]==values[i]) for i in range(len(values))]

I agree this is time consuming.我同意这很耗时。 Elementwise comparison can't be avoided here so np.any(arr[i]==values[i]) or values[i] in arr[i] is must-do here.这里无法避免元素比较,因此np.any(arr[i]==values[i])values[i] in arr[i]这里是必须的。 What about vectorizations, I found it quite difficult to replace list comprehension used here too.关于矢量化,我发现替换这里使用的列表理解也很困难。 This is my way using np.vectorize :这是我使用np.vectorize

def myfunc(i): return np.any(arr[i]==values[i])
vfunc = np.vectorize(myfunc)
vfunc(np.arange(len(values)))
# output: array([False, False,  True])

You've basically identified the two options:您基本上确定了两个选项:

In [35]: timeit [(i==a).any() for i,a in zip(values, arr)]                                     
29 µs ± 543 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [36]: timeit (values[:,None,None]==arr).any(axis=(1,2))                                     
11.4 µs ± 10.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In this small case the big array approach is faster.在这种小情况下,大阵列方法更快。 But for a larger case, the iteration might be better.但是对于更大的情况,迭代可能会更好。 Memory management with the larger arrays may cancel out the time savings.使用较大阵列的内存管理可能会抵消节省的时间。 It's often the case that a few iterations on a complex problem are better than the fully 'vectorized' version.通常情况下,复杂问题的几次迭代比完全“矢量化”的版本要好。

If it's something you do repeatedly, you could take the time craft a hybrid solution, one that iterates on blocks.如果这是您反复做的事情,您可以花时间制作一个混合解决方案,一个在块上迭代的解决方案。 But you'd have to judge that yourself.但你必须自己判断。

isin and related code either ors some tests, or using sort of some sort to put like values next to each other for easy comparison. isin以及相关的代码,通过ors一些测试,或者使用sort某种放像值彼此相邻,便于比较。

The other approach is to write a fully iterative solution, and let numba compile it for you.另一种方法是编写一个完全迭代的解决方案,让numba为您编译。

As hpaulj already mentioned numba can be an option here.正如 hpaulj 已经提到的numba可以是这里的一个选择。

Example例子

import numpy as np
import numba as nb

#Turn off parallelization for tiny problems
@nb.njit(parallel=True)
def example(values,arr):
    #Make sure that the first dimension is the same
    assert arr.shape[0]==values.shape[0]
    out=np.empty(values.shape[0],dtype=nb.bool_)

    for i in nb.prange(arr.shape[0]):
        out[i]=False
        for j in range(arr.shape[1]):
            if arr[i,j]==values[i]:
                out[i]=True
                break
    return out

Timings (small arrays)时序(小阵列)

#your input data
%timeit example(values,arr.reshape(arr.shape[0],-1))# #parallel=True
#10.7 µs ± 34.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit example(values,arr.reshape(arr.shape[0],-1))# #parallel=False
#2.15 µs ± 49.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

#Methods from other answers
%timeit (values[:,None,None]==arr).any(axis=(1,2))
#9.52 µs ± 323 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit [(i==a).any() for i,a in zip(values, arr)]
#23.9 µs ± 435 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Timings (larger arrays)时序(更大的阵列)

values=np.random.randint(low=1,high=100_000,size=1_000_000)
arr=np.random.randint(low=1,high=10_00,size=1_000_000*100).reshape(1_000_000,10,10)

%timeit example(values,arr.reshape(arr.shape[0],-1)) #parallel=True
#48.2 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit example(values,arr.reshape(arr.shape[0],-1)) #parallel=False
#90.5 ms ± 618 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

#Methods from other answers
%timeit (values[:,None,None]==arr).any(axis=(1,2))
#186 ms ± 5.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit [(i==a).any() for i,a in zip(values, arr)]
#6.63 s ± 69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM