[英]Fastest way to filter lists of lists based on a third list?
I have a list A
like the following: 我有一个列表
A
类似如下:
A = np.array([[1,2] ,
[2,4] ,
[3,4] ,
[4,5] ,
[6,7]])
and I need to remove all sublists containing any of the elements in a third list B
. 我需要删除包含第三个列表
B
中任何元素的所有子列表。
So if for example: 所以,例如:
B = [1,2,5]
The expected result would be: 预期结果将是:
np.array([[3,4] ,
[6,7]])
The length of A gets up to 1,500,000 and B is also often in the tens of thousands of elements, so performance is critical. A的长度高达1,500,000,B也经常出现在数万个元素中,因此性能至关重要。 The length of the sublists of
A
is always 2. A
的子列表长度始终为2。
All approaches presented here are based on numpys boolean indexing . 此处介绍的所有方法都基于numpys布尔索引 。 The approach is to identify matches (independant of row) and then use a reduction (
np.any
or np.all
) along the rows to see which rows should be eliminated and which should be kept. 方法是识别匹配(独立于行),然后沿着行使用缩减(
np.any
或np.all
)来查看应该删除哪些行以及应该保留哪些行。 Finally this mask is applied to your array A
to get only the valid rows. 最后,此掩码将应用于阵列
A
以仅获取有效行。 The only real difference between the approaches is how you create the mask. 这些方法之间唯一真正的区别在于如何创建蒙版。
If the values of B
are known in advance you generally use |
如果事先知道
B
的值,则通常使用|
(or operator) chained comparisons. (或运营商)链式比较。
a[~np.any(((a == 1) | (a == 2) | (a == 5)), axis=1)]
I'll go through this step-by-step: 我会逐步完成这个步骤:
Finding matches 寻找比赛
>>> ((a == 1) | (a == 2) | (a == 5)) array([[ True, True], [ True, False], [False, False], [False, True], [False, False]], dtype=bool)
Check each row for one True
: 检查每一行是否为
True
:
>>> np.any(((a == 1) | (a == 2) | (a == 5)), axis=1) array([ True, True, False, True, False], dtype=bool)
Invert it: 反转它:
>>> ~np.any(((a == 1) | (a == 2) | (a == 5)), axis=1) array([False, False, True, False, True], dtype=bool)
Use boolean indexing: 使用布尔索引:
>>> a[~np.any(((a == 1) | (a == 2) | (a == 5)), axis=1)] array([[3, 4], [6, 7]])
Instead of these a == 1 | a == 2 | ...
而不是这些
a == 1 | a == 2 | ...
a == 1 | a == 2 | ...
a == 1 | a == 2 | ...
you could also use np.in1d
: a == 1 | a == 2 | ...
你也可以使用np.in1d
:
>>> np.in1d(a, [1, 2, 5]).reshape(a.shape)
array([[ True, True],
[ True, False],
[False, False],
[False, True],
[False, False]], dtype=bool)
and then use essentially the same approach as above 然后使用与上面基本相同的方法
>>> a[~np.any(np.in1d(a, [1, 2, 5]).reshape(a.shape), axis=1)]
array([[3, 4],
[6, 7]])
In case b
is sorted you can also use np.searchsorted
to create the mask: 如果
b
已排序,您还可以使用np.searchsorted
来创建掩码:
>>> np.searchsorted([1, 2, 5], a, side='left') == np.searchsorted([1, 2, 5], a, side='right')
array([[False, False],
[False, True],
[ True, True],
[ True, False],
[ True, True]], dtype=bool)
This time you'd need to check if all
values in reach row are True
: 这次你需要检查到达行中的
all
值是否为True
:
>>> b = [1, 2, 5]
>>> a[np.all(np.searchsorted(b, a, side='left') == np.searchsorted(b, a, side='right'), axis=1)]
array([[3, 4],
[6, 7]])
The first approach isn't exactly suitable for arbitary B
so I don't include it in these timings. 第一种方法并不完全适用于仲裁
B
所以我不在这些时间中包括它。
import numpy as np
def setapproach(A, B): # author: Max Chrétien
B = set(B)
indices_to_del = [i for i, sublist in enumerate(A) if B & set(sublist)]
C = np.delete(A, indices_to_del, 0)
return C
def setapproach2(A, B): # author: Max Chrétien & Ev. Kounis
B = set(B)
return np.array([sublist for sublist in A if not B & set(sublist)])
def isinapproach(a, b):
return a[~np.any(np.in1d(a, b).reshape(a.shape), axis=1)]
def searchsortedapproach(a, b):
b.sort()
return a[np.all(np.searchsorted(b, a, side='left') == np.searchsorted(b, a, side='right'), axis=1)]
A = np.random.randint(0, 10000, (100000, 2))
B = np.random.randint(0, 10000, 2000)
%timeit setapproach(A, B)
# 929 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit setapproach2(A, B)
# 1.04 s ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit isinapproach(A, B)
# 59.1 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit searchsortedapproach(A, B)
# 56.1 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The timings, however depend on the range of values, if B
is already sorted and the lengths of A
, B
. 然而,如果
B
已经被排序并且A
, B
的长度,则时间取决于值的范围。 But the numpy approaches seams to be almost 20 times faster than the set-solutions. 但是numpy接近接缝的速度几乎是设定解决方案的20倍。 However the difference is mostly because iteration over numpy-arrays with python loops is really inefficient so I'll convert
A
and B
to list
s first: 然而,差异主要是因为使用python循环对numpy-arrays进行迭代的效率非常低,所以我首先将
A
和B
转换为list
:
def setapproach_updated(A, B):
B = set(B)
indices_to_del = [i for i, sublist in enumerate(A.tolist()) if B & set(sublist)]
C = np.delete(A, indices_to_del, 0)
return C
def setapproach2_updated(A, B):
B = set(B)
return np.array([sublist for sublist in A.tolist() if not B & set(sublist)])
That may seem strange but let's redo the timings: 这可能看起来很奇怪,但让我们重做时间:
%timeit setapproach_updated(A, B)
# 300 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit setapproach2_updated(A, B)
# 378 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This is much faster than the plain loops, just by converting it with tolist
first, but still 5+ times slower than the numpy approaches. 这比普通循环要快得多,只需
tolist
其转换为tolist
,但仍然比numpy方法慢5倍。
So remember: When you have to use Python-based approaches on NumPy arrays check if it is faster to convert it to a list first! 所以请记住: 当你必须在NumPy数组上使用基于Python的方法时 ,检查它是否更快将其转换为列表!
Let's see how that performs on bigger arrays (these are sizes that approximate those mentioned in your question): 让我们看看它是如何在更大的数组上执行的(这些大小与您的问题中提到的大小相近):
A = np.random.randint(0, 10000000, (1500000, 2))
B = np.random.randint(0, 10000000, 50000)
%timeit setapproach_updated(A, B)
# 4.14 s ± 66.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit setapproach2_updated(A, B)
# 6.33 s ± 95.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit isinapproach(A, B)
# 2.39 s ± 102 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit searchsortedapproach(A, B)
# 1.34 s ± 21.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The differences got smaller and the searchsorted
-approach definetly wins. 差异变得越来越小,
searchsorted
排序 - searchsorted
肯定胜利。
I'm not finished yet! 我还没完呢! Let me surprise you with numba , it's not a lightweight package but extremly powerful if it supports the types and functions you need:
让我用numba让你大吃一惊,它不是一个轻量级的包, 但是 如果它支持你需要的类型和功能,那就非常强大:
import numba as nb
@nb.njit # the magic is this decorator
def numba_approach(A, B):
Bset = set(B)
mask = np.ones(A.shape[0], dtype=nb.bool_)
for idx in range(A.shape[0]):
for item in A[idx]:
if item in Bset:
mask[idx] = False
break
return A[mask]
Let's see how that performs: 让我们看看它的表现如何:
A = np.random.randint(0, 10000, (100000, 2))
B = np.random.randint(0, 10000, 2000)
numba_approach(A, B) # numba needs a warmup run because it's just-in-time compiling
%timeit numba_approach(A, B)
# 6.12 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# This is 10 times faster than the fastest other approach!
A = np.random.randint(0, 10000000, (1500000, 2))
B = np.random.randint(0, 10000000, 50000)
%timeit numba_approach(A, B)
# 286 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# This is still 4 times faster than the fastest other approach!
So, you can make it another order of magnitude faster. 所以,你可以让它快一个数量级。 Numba doesn't support all python/numpy features (and not all of them are faster) but in this case it's enough!
Numba不支持所有python / numpy功能(并不是所有功能都更快)但在这种情况下它足够了!
Using set
- intersection to recreate a new list of indices where [1, 2, 5]
is in your sublists. 使用
set
-intersection重新创建一个新的索引列表,其中[1, 2, 5]
1,2,5 [1, 2, 5]
在您的子列表中。 Then with the list of the indices to remove, use np.delete()
function of integrated in numpy. 然后使用要删除的索引列表,使用集成在numpy中的
np.delete()
函数。
import numpy as np
A = np.array([[1,2],
[2,4],
[3,4],
[4,5],
[6,7]])
B = set([1, 2, 5])
indices_to_del = [i for i, sublist in enumerate(A) if B & set(sublist)]
C = np.delete(A, indices_to_del, 0)
print C
#[[3 4]
# [6 7]]
EDIT 编辑
Thanks to @MSeifert I was able to improve my answer. 感谢@MSeifert我能够改进我的答案。
@Ev.Kounis proposed another similar, but faster solution: @ Ev.Kounis提出了另一个类似但更快的解决方案:
D = np.array([sublist for sublist in A if not B & set(sublist)])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.