如果存在于另一个数组中，则从一个数组中删除元素，保留重复项 - NumPy / Python

Question

I have two arrays A (len of 3.8million) and B (len of 20k).我有两个数组A （380 万的 len）和B （20k 的 len）。 For the minimal example, lets take this case:对于最小的例子，让我们来看这个案例：

A = np.array([1,1,2,3,3,3,4,5,6,7,8,8])
B = np.array([1,2,8])

Now I want the resulting array to be:现在我希望结果数组是：

C = np.array([3,3,3,4,5,6,7])

ie if any value in B is found in A , remove it from A , if not keep it.即，如果在任何值B中发现A ，从删除它A ，如果不保持它。

I would like to know if there is any way to do it without a for loop because it is a lengthy array and so it takes long time to loop.我想知道在没有for循环的情况下是否有任何方法可以做到这一点for因为它是一个冗长的数组，因此需要很长时间才能循环。

Answer 1

Using `searchsorted`使用`searchsorted`

With sorted B , we can use searchsorted -使用 sorted B ，我们可以使用searchsorted -

A[B[np.searchsorted(B,A)] !=  A]

From the linked docs, searchsorted(a,v) find the indices into a sorted array a such that, if the corresponding elements in v were inserted before the indices, the order of a would be preserved.从链接的文档中， searchsorted(a,v)将索引找到到排序数组a这样，如果v中的相应元素插入在索引之前，则将保留 a 的顺序。 So, let's say idx = searchsorted(B,A) and we index into B with those : B[idx] , we will get a mapped version of B corresponding to every element in A .所以，假设idx = searchsorted(B,A)并且我们用这些索引到B中： B[idx] ，我们将得到B的映射版本，对应于A每个元素。 Thus, comparing this mapped version against A would tell us for every element in A if there's a match in B or not.因此，对这个比较映射版本A会告诉我们，在每一个元素A ，如果有一个匹配的B与否。 Finally, index into A to select the non-matching ones.最后，索引到A以选择不匹配的。

Generic case ( B is not sorted) :一般情况（ B未排序）：

If B is not already sorted as is the pre-requisite, sort it and then use the proposed method.如果B尚未按先决条件排序，则对其进行排序，然后使用建议的方法。

Alternatively, we can use sorter argument with searchsorted -或者，我们可以使用sorter参数与searchsorted -

sidx = B.argsort()
out = A[B[sidx[np.searchsorted(B,A,sorter=sidx)]] != A]

More generic case ( A has values higher than ones in B ) :更通用的情况（ A值高于B值）：

sidx = B.argsort()
idx = np.searchsorted(B,A,sorter=sidx)
idx[idx==len(B)] = 0
out = A[B[sidx[idx]] != A]

Using `in1d/isin`使用`in1d/isin`

We can also use np.in1d , which is pretty straight-forward (the linked docs should help clarify) as it looks for any match in B for every element in A and then we can use boolean-indexing with an inverted mask to look for non-matching ones -我们也可以使用np.in1d ，这是非常直接的（链接的文档将有助于澄清），因为它会在任何比赛B在每一个元素A ，然后我们可以使用布尔索引与反转屏蔽寻找不匹配的 -

A[~np.in1d(A,B)]

Same with isin -与isin相同 -

A[~np.isin(A,B)]

With invert flag -带invert标志 -

A[np.in1d(A,B,invert=True)]

A[np.isin(A,B,invert=True)]

This solves for a generic when B is not necessarily sorted.当B不一定排序时，这解决了泛型问题。

Answer 2

I am not very familiar with numpy, but how about using sets:我对 numpy 不是很熟悉，但是如何使用集合：

C = set(A.flat) - set(B.flat)

EDIT : from comments, sets cannot have duplicates values.编辑：从评论来看，集合不能有重复的值。

So another solution would be to use a lambda expression :所以另一种解决方案是使用 lambda 表达式：

C = np.array(list(filter(lambda x: x not in B, A)))

Answer 3

Adding to Divakar's answer above -添加到上面Divakar 的答案-

if the original array A has a wider range than B, that will give you an 'index out of bounds' error.如果原始数组 A 的范围比 B 的范围更广，则会出现“索引越界”错误。 See:看：

A = np.array([1,1,2,3,3,3,4,5,6,7,8,8,10,12,14])
B = np.array([1,2,8])

A[B[np.searchsorted(B,A)] !=  A]
>> IndexError: index 3 is out of bounds for axis 0 with size 3

This will happen because np.searchsorted will assign index 3 (one-past-the-last in B) as the appropriate position for inserting in B the elements 10, 12 and 14 from A, in this example.发生这种情况是因为np.searchsorted将分配索引 3（B 中的最后一个）作为在 B 中插入 A 中的元素 10、12 和 14 的适当位置，在此示例中。 Thus you get an IndexError in B[np.searchsorted(B,A)] .因此你会在B[np.searchsorted(B,A)]得到一个 IndexError 。

To circumvent that, a possible approach is:为了避免这种情况，一种可能的方法是：

def subset_sorted_array(A,B):
    Aa = A[np.where(A <= np.max(B))]
    Bb = (B[np.searchsorted(B,Aa)] !=  Aa)
    Bb = np.pad(Bb,(0,A.shape[0]-Aa.shape[0]), method='constant', constant_values=True)
    return A[Bb]

Which works as follows:其工作原理如下：

# Take only the elements in A that would be inserted in B
Aa = A[np.where(A <= np.max(B))]

# Pad the resulting filter with 'Trues' - I split this in two operations for
# easier reading
Bb = (B[np.searchsorted(B,Aa)] !=  Aa)
Bb = np.pad(Bb,(0,A.shape[0]-Aa.shape[0]),  method='constant', constant_values=True)

# Then you can filter A by Bb
A[Bb]
# For the input arrays above:
>> array([ 3,  3,  3,  4,  5,  6,  7, 10, 12, 14])

Notice this will also work between arrays of strings and other types (for all types for which the comparison <= operator is defined).请注意，这也适用于字符串数组和其他类型（对于定义了比较<=运算符的所有类型）。

如果存在于另一个数组中，则从一个数组中删除元素，保留重复项 - NumPy / Python

问题描述

3 个解决方案

解决方案1
14 已采纳 2018-09-20 05:09:31

Using `searchsorted`使用`searchsorted`

Using `in1d/isin`使用`in1d/isin`

解决方案2
3 2018-09-20 05:11:40

解决方案3
1 2020-08-07 17:28:28

如果存在于另一个数组中，则从一个数组中删除元素，保留重复项 - NumPy / Python

问题描述

3 个解决方案

解决方案1 14 已采纳 2018-09-20 05:09:31

Using searchsorted使用searchsorted

Using in1d/isin使用in1d/isin

解决方案2 3 2018-09-20 05:11:40

解决方案3 1 2020-08-07 17:28:28

解决方案1
14 已采纳 2018-09-20 05:09:31

Using `searchsorted`使用`searchsorted`

Using `in1d/isin`使用`in1d/isin`

解决方案2
3 2018-09-20 05:11:40

解决方案3
1 2020-08-07 17:28:28