[英]Remove elements from one array if present in another array, keep duplicates - NumPy / Python
I have two arrays A
(len of 3.8million) and B
(len of 20k).我有两个数组
A
(380 万的 len)和B
(20k 的 len)。 For the minimal example, lets take this case:对于最小的例子,让我们来看这个案例:
A = np.array([1,1,2,3,3,3,4,5,6,7,8,8])
B = np.array([1,2,8])
Now I want the resulting array to be:现在我希望结果数组是:
C = np.array([3,3,3,4,5,6,7])
ie if any value in B
is found in A
, remove it from A
, if not keep it.即,如果在任何值
B
中发现A
,从删除它A
,如果不保持它。
I would like to know if there is any way to do it without a for
loop because it is a lengthy array and so it takes long time to loop.我想知道在没有
for
循环的情况下是否有任何方法可以做到这一点for
因为它是一个冗长的数组,因此需要很长时间才能循环。
searchsorted
searchsorted
With sorted B
, we can use searchsorted
-使用 sorted
B
,我们可以使用searchsorted
-
A[B[np.searchsorted(B,A)] != A]
From the linked docs, searchsorted(a,v)
find the indices into a sorted array a
such that, if the corresponding elements in v
were inserted before the indices, the order of a would be preserved.从链接的文档中,
searchsorted(a,v)
将索引找到到排序数组a
这样,如果v
中的相应元素插入在索引之前,则将保留 a 的顺序。 So, let's say idx = searchsorted(B,A)
and we index into B
with those : B[idx]
, we will get a mapped version of B
corresponding to every element in A
.所以,假设
idx = searchsorted(B,A)
并且我们用这些索引到B
中: B[idx]
,我们将得到B
的映射版本,对应于A
每个元素。 Thus, comparing this mapped version against A
would tell us for every element in A
if there's a match in B
or not.因此,对这个比较映射版本
A
会告诉我们,在每一个元素A
,如果有一个匹配的B
与否。 Finally, index into A
to select the non-matching ones.最后,索引到
A
以选择不匹配的。
Generic case ( B
is not sorted) :一般情况(
B
未排序):
If B
is not already sorted as is the pre-requisite, sort it and then use the proposed method.如果
B
尚未按先决条件排序,则对其进行排序,然后使用建议的方法。
Alternatively, we can use sorter
argument with searchsorted
-或者,我们可以使用
sorter
参数与searchsorted
-
sidx = B.argsort()
out = A[B[sidx[np.searchsorted(B,A,sorter=sidx)]] != A]
More generic case ( A
has values higher than ones in B
) :更通用的情况(
A
值高于B
值):
sidx = B.argsort()
idx = np.searchsorted(B,A,sorter=sidx)
idx[idx==len(B)] = 0
out = A[B[sidx[idx]] != A]
in1d/isin
in1d/isin
We can also use np.in1d
, which is pretty straight-forward (the linked docs should help clarify) as it looks for any match in B
for every element in A
and then we can use boolean-indexing with an inverted mask to look for non-matching ones -我们也可以使用
np.in1d
,这是非常直接的(链接的文档将有助于澄清),因为它会在任何比赛B
在每一个元素A
,然后我们可以使用布尔索引与反转屏蔽寻找不匹配的 -
A[~np.in1d(A,B)]
Same with isin
-与
isin
相同 -
A[~np.isin(A,B)]
With invert
flag -带
invert
标志 -
A[np.in1d(A,B,invert=True)]
A[np.isin(A,B,invert=True)]
This solves for a generic when B
is not necessarily sorted.当
B
不一定排序时,这解决了泛型问题。
I am not very familiar with numpy, but how about using sets:我对 numpy 不是很熟悉,但是如何使用集合:
C = set(A.flat) - set(B.flat)
EDIT : from comments, sets cannot have duplicates values.编辑:从评论来看,集合不能有重复的值。
So another solution would be to use a lambda expression :所以另一种解决方案是使用 lambda 表达式:
C = np.array(list(filter(lambda x: x not in B, A)))
Adding to Divakar's answer above -添加到上面Divakar 的答案-
if the original array A has a wider range than B, that will give you an 'index out of bounds' error.如果原始数组 A 的范围比 B 的范围更广,则会出现“索引越界”错误。 See:
看:
A = np.array([1,1,2,3,3,3,4,5,6,7,8,8,10,12,14])
B = np.array([1,2,8])
A[B[np.searchsorted(B,A)] != A]
>> IndexError: index 3 is out of bounds for axis 0 with size 3
This will happen because np.searchsorted
will assign index 3 (one-past-the-last in B) as the appropriate position for inserting in B the elements 10, 12 and 14 from A, in this example.发生这种情况是因为
np.searchsorted
将分配索引 3(B 中的最后一个)作为在 B 中插入 A 中的元素 10、12 和 14 的适当位置,在此示例中。 Thus you get an IndexError in B[np.searchsorted(B,A)]
.因此你会在
B[np.searchsorted(B,A)]
得到一个 IndexError 。
To circumvent that, a possible approach is:为了避免这种情况,一种可能的方法是:
def subset_sorted_array(A,B):
Aa = A[np.where(A <= np.max(B))]
Bb = (B[np.searchsorted(B,Aa)] != Aa)
Bb = np.pad(Bb,(0,A.shape[0]-Aa.shape[0]), method='constant', constant_values=True)
return A[Bb]
Which works as follows:其工作原理如下:
# Take only the elements in A that would be inserted in B
Aa = A[np.where(A <= np.max(B))]
# Pad the resulting filter with 'Trues' - I split this in two operations for
# easier reading
Bb = (B[np.searchsorted(B,Aa)] != Aa)
Bb = np.pad(Bb,(0,A.shape[0]-Aa.shape[0]), method='constant', constant_values=True)
# Then you can filter A by Bb
A[Bb]
# For the input arrays above:
>> array([ 3, 3, 3, 4, 5, 6, 7, 10, 12, 14])
Notice this will also work between arrays of strings and other types (for all types for which the comparison <=
operator is defined).请注意,这也适用于字符串数组和其他类型(对于定义了比较
<=
运算符的所有类型)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.