简体   繁体   English

如何检查numpy数组的所有元素是否在另一个numpy数组中

[英]How to check if all elements of a numpy array are in another numpy array

I have two 2D numpy arrays, for example: 我有两个2D numpy数组,例如:

A = numpy.array([[1, 2, 4, 8], [16, 32, 32, 8], [64, 32, 16, 8]])

and

B = numpy.array([[1, 2], [32, 32]])

I want to have all lines from A where I can find all elements from any of the lines of B . 我希望得到A中的所有行,我可以从B任何行找到所有元素。 Where there are 2 of the same element in a row of B , lines from A must contain at least 2 as well. 如果B行中有2个相同的元素,则A中的行必须至少包含2个。 In case of my example, I want to achieve this: 在我的例子中,我想实现这个目标:

A_filtered = [[1, 2, 4, 8], [16, 32, 32, 8]]

I have control over the values representation so I chose numbers where the binary representation has only one place with 1 (example: 0b00000001 and 0b00000010 , etc...) This way I can easily check if all type of values are in the row by using np.logical_or.reduce() function, but I cannot check that the number of the same elements are bigger or equal in a row of A . 我控制了值表示,所以我选择了二进制表示只有一个位置1 (例如: 0b000000010b00000010等等)这样我可以通过使用方式检查所有类型的值是否在行中np.logical_or.reduce()函数,但我无法检查A行中相同元素的数量是大于还是相等。 I was really hoping that I could avoid simple for loop and deep copies of the arrays as the performance is a very important aspect for me. 我真的希望我可以避免简单的for循环和数组的深层副本,因为性能对我来说是一个非常重要的方面。

How can I do that in numpy in an efficient way? 我怎么能以有效的方式在numpy中做到这一点?


Update: 更新:

A solution from here may work, but I think the performance is a big concern for me, the A can be really big (>300000 rows) and B can be moderate (>30): 这里的解决方案可能有效,但我认为性能是我的一个大问题, A可能非常大(> 300000行), B可以适中(> 30):

[set(row).issuperset(hand) for row in A.tolist() for hand in B.tolist()]

Update 2: 更新2:

The set() solution is not working since the set() drops all duplicated values. set()解决方案无法正常工作,因为set()会删除所有重复的值。

I think this should work: 我认为这应该有效:

First, encode the data as follows (this assumes a limited number of 'tokens', as your binary scheme also seems to imply): 首先,按如下方式对数据进行编码(这假设有限数量的'令牌',因为您的二进制方案似乎也暗示着):

Make A shape [n_rows, n_tokens], dtype int8, where each element counts the number of tokens. 创建一个形状[n_rows,n_tokens],dtype int8,其中每个元素计算标记的数量。 Encode B in the same way, with shape [n_hands, n_tokens] 以相同的方式对B进行编码,形状为[n_hands,n_tokens]

This allows for a single vectorized expression of your output; 这允许输出的单个矢量化表达式; matches = (A[None, :, :] >= B[:, None, :]).all(axis=-1). matches =(A [None,:,]]> = B [:,None,:])。all(axis = -1)。 (exactly how to map this matches array to your desired output format is left as an excerise to the reader since the question leaves it undefined for multiple matches). (确切地说如何将这个匹配数组映射到你想要的输出格式留给读者一个例外,因为这个问题在多个匹配时没有定义)。

But we are talking > 10Mbyte of memory per token here. 但我们在这里谈论每个令牌大约10Mbyte的内存。 Even with 32 tokens that should not unthinkable; 即使有32个代币也不应该是不可想象的; but in a situation like this it tends to be better to not vectorize the loop over n_tokens or n_hands, or both; 但在这样的情况下,最好不要在n_tokens或n_hands或两者上对循环进行矢量化; for loops are fine for small n, or if there is sufficient work to be done in the body, such that the looping overhead is insignificant. for循环适用于小n,或者如果在体内有足够的工作要做,那么循环开销是微不足道的。

As long as n_tokens and n_hands remain moderate, I think this will be the fastest solution, if staying within the realm of pure python and numpy. 只要n_tokens和n_hands保持适度,我认为这将是最快的解决方案,如果保持在纯python和numpy的范围内。

I hope I got your question right. 我希望我的问题是正确的。 At least it works with the problem you described in your question. 至少它适用于您在问题中描述的问题。 If the order of the output should stay the same as the input, change the inplace-sort. 如果输出的顺序应与输入保持一致,请更改inplace-sort。

The code looks quite ugly, but should perform well and shouldn't be to hard to understand. 代码看起来很丑陋,但应该表现良好,不应该难以理解。

Code

import time
import numba as nb
import numpy as np

@nb.njit(fastmath=True,parallel=True)
def filter(A,B):
  iFilter=np.zeros(A.shape[0],dtype=nb.bool_)

  for i in nb.prange(A.shape[0]):
    break_loop=False

    for j in range(B.shape[0]):
      ind_to_B=0
      for k in range(A.shape[1]):
        if A[i,k]==B[j,ind_to_B]:
          ind_to_B+=1

        if ind_to_B==B.shape[1]:
          iFilter[i]=True
          break_loop=True
          break

      if break_loop==True:
        break

  return A[iFilter,:]

Measuring performance 测量性能

####First call has some compilation overhead####
A=np.random.randint(low=0, high=60, size=300_000*4).reshape(300_000,4)
B=np.random.randint(low=0, high=60, size=30*2).reshape(30,2)

t1=time.time()
#At first sort the arrays
A.sort()
B.sort()
A_filtered=filter(A,B)
print(time.time()-t1)

####Let's measure the second call too####
A=np.random.randint(low=0, high=60, size=300_000*4).reshape(300_000,4)
B=np.random.randint(low=0, high=60, size=30*2).reshape(30,2)

t1=time.time()
#At first sort the arrays
A.sort()
B.sort()
A_filtered=filter(A,B)
print(time.time()-t1)

Results 结果

46ms after the first run on a dual-core Notebook (sorting included)
32ms (sorting excluded)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM