简体   繁体   English

如果存在于另一个数组中,则从一个数组中删除元素,保留重复项 - NumPy / Python

[英]Remove elements from one array if present in another array, keep duplicates - NumPy / Python

I have two arrays A (len of 3.8million) and B (len of 20k).我有两个数组A (380 万的 len)和B (20k 的 len)。 For the minimal example, lets take this case:对于最小的例子,让我们来看这个案例:

A = np.array([1,1,2,3,3,3,4,5,6,7,8,8])
B = np.array([1,2,8])

Now I want the resulting array to be:现在我希望结果数组是:

C = np.array([3,3,3,4,5,6,7])

ie if any value in B is found in A , remove it from A , if not keep it.即,如果在任何值B中发现A ,从删除它A ,如果不保持它。

I would like to know if there is any way to do it without a for loop because it is a lengthy array and so it takes long time to loop.我想知道在没有for循环的情况下是否有任何方法可以做到这一点for因为它是一个冗长的数组,因此需要很长时间才能循环。

Using searchsorted使用searchsorted

With sorted B , we can use searchsorted -使用 sorted B ,我们可以使用searchsorted -

A[B[np.searchsorted(B,A)] !=  A]

From the linked docs, searchsorted(a,v) find the indices into a sorted array a such that, if the corresponding elements in v were inserted before the indices, the order of a would be preserved.从链接的文档中, searchsorted(a,v)将索引找到到排序数组a这样,如果v中的相应元素插入在索引之前,则将保留 a 的顺序。 So, let's say idx = searchsorted(B,A) and we index into B with those : B[idx] , we will get a mapped version of B corresponding to every element in A .所以,假设idx = searchsorted(B,A)并且我们用这些索引到B中: B[idx] ,我们将得到B的映射版本,对应于A每个元素。 Thus, comparing this mapped version against A would tell us for every element in A if there's a match in B or not.因此,对这个比较映射版本A会告诉我们,在每一个元素A ,如果有一个匹配的B与否。 Finally, index into A to select the non-matching ones.最后,索引到A以选择不匹配的。

Generic case ( B is not sorted) :一般情况( B未排序):

If B is not already sorted as is the pre-requisite, sort it and then use the proposed method.如果B尚未按先决条件排序,则对其进行排序,然后使用建议的方法。

Alternatively, we can use sorter argument with searchsorted -或者,我们可以使用sorter参数与searchsorted -

sidx = B.argsort()
out = A[B[sidx[np.searchsorted(B,A,sorter=sidx)]] != A]

More generic case ( A has values higher than ones in B ) :更通用的情况( A值高于B值):

sidx = B.argsort()
idx = np.searchsorted(B,A,sorter=sidx)
idx[idx==len(B)] = 0
out = A[B[sidx[idx]] != A]

Using in1d/isin使用in1d/isin

We can also use np.in1d , which is pretty straight-forward (the linked docs should help clarify) as it looks for any match in B for every element in A and then we can use boolean-indexing with an inverted mask to look for non-matching ones -我们也可以使用np.in1d ,这是非常直接的(链接的文档将有助于澄清),因为它会在任何比赛B在每一个元素A ,然后我们可以使用布尔索引与反转屏蔽寻找不匹配的 -

A[~np.in1d(A,B)]

Same with isin -isin相同 -

A[~np.isin(A,B)]

With invert flag -invert标志 -

A[np.in1d(A,B,invert=True)]

A[np.isin(A,B,invert=True)]

This solves for a generic when B is not necessarily sorted.B不一定排序时,这解决了泛型问题。

I am not very familiar with numpy, but how about using sets:我对 numpy 不是很熟悉,但是如何使用集合:

C = set(A.flat) - set(B.flat)

EDIT : from comments, sets cannot have duplicates values.编辑:从评论来看,集合不能有重复的值。

So another solution would be to use a lambda expression :所以另一种解决方案是使用 lambda 表达式:

C = np.array(list(filter(lambda x: x not in B, A)))

Adding to Divakar's answer above -添加到上面Divakar 的答案-

if the original array A has a wider range than B, that will give you an 'index out of bounds' error.如果原始数组 A 的范围比 B 的范围更广,则会出现“索引越界”错误。 See:看:

A = np.array([1,1,2,3,3,3,4,5,6,7,8,8,10,12,14])
B = np.array([1,2,8])

A[B[np.searchsorted(B,A)] !=  A]
>> IndexError: index 3 is out of bounds for axis 0 with size 3

This will happen because np.searchsorted will assign index 3 (one-past-the-last in B) as the appropriate position for inserting in B the elements 10, 12 and 14 from A, in this example.发生这种情况是因为np.searchsorted将分配索引 3(B 中的最后一个)作为在 B 中插入 A 中的元素 10、12 和 14 的适当位置,在此示例中。 Thus you get an IndexError in B[np.searchsorted(B,A)] .因此你会在B[np.searchsorted(B,A)]得到一个 IndexError 。

To circumvent that, a possible approach is:为了避免这种情况,一种可能的方法是:

def subset_sorted_array(A,B):
    Aa = A[np.where(A <= np.max(B))]
    Bb = (B[np.searchsorted(B,Aa)] !=  Aa)
    Bb = np.pad(Bb,(0,A.shape[0]-Aa.shape[0]), method='constant', constant_values=True)
    return A[Bb]

Which works as follows:其工作原理如下:

# Take only the elements in A that would be inserted in B
Aa = A[np.where(A <= np.max(B))]

# Pad the resulting filter with 'Trues' - I split this in two operations for
# easier reading
Bb = (B[np.searchsorted(B,Aa)] !=  Aa)
Bb = np.pad(Bb,(0,A.shape[0]-Aa.shape[0]),  method='constant', constant_values=True)

# Then you can filter A by Bb
A[Bb]
# For the input arrays above:
>> array([ 3,  3,  3,  4,  5,  6,  7, 10, 12, 14])

Notice this will also work between arrays of strings and other types (for all types for which the comparison <= operator is defined).请注意,这也适用于字符串数组和其他类型(对于定义了比较<=运算符的所有类型)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM