[英]Efficiently check if an element occurs at least n times in a list
How to best write a Python function ( check_list
) to efficiently test if an element ( x
) occurs at least n
times in a list ( l
)?如何最好地编写 Python 函数 ( check_list
) 以有效测试元素 ( x
) 在列表 ( l
) 中是否至少出现n
次?
My first thought was:我的第一个想法是:
def check_list(l, x, n):
return l.count(x) >= n
But this doesn't short-circuit once x
has been found n
times and is always O(n).但是一旦x
被找到n
次并且总是 O(n),这不会短路。
A simple approach that does short-circuit would be:一个简单的短路方法是:
def check_list(l, x, n):
count = 0
for item in l:
if item == x:
count += 1
if count == n:
return True
return False
I also have a more compact short-circuiting solution with a generator:我还有一个更紧凑的带发电机的短路解决方案:
def check_list(l, x, n):
gen = (1 for item in l if item == x)
return all(next(gen,0) for i in range(n))
Are there other good solutions?还有其他好的解决方案吗? What is the best efficient approach?什么是最有效的方法?
Thank you谢谢
Instead of incurring extra overhead with the setup of a range
object and using all
which has to test the truthiness of each item, you could use itertools.islice
to advance the generator n
steps ahead, and then return the next item in the slice if the slice exists or a default False
if not:您可以使用itertools.islice
将生成器提前n
步,而不是通过设置range
对象并使用all
必须测试每个项目的真实性而产生额外开销,然后返回切片中的下一个项目,如果切片存在或默认为False
如果不存在:
from itertools import islice
def check_list(lst, x, n):
gen = (True for i in lst if i==x)
return next(islice(gen, n-1, None), False)
Note that like list.count
, itertools.islice
also runs at C speed.请注意,与list.count
一样, itertools.islice
也以 C 速度运行。 And this has the extra advantage of handling iterables that are not lists.这具有处理不是列表的可迭代对象的额外优势。
Some timing:一些时间:
In [1]: from itertools import islice
In [2]: from random import randrange
In [3]: lst = [randrange(1,10) for i in range(100000)]
In [5]: %%timeit # using list.index
....: check_list(lst, 5, 1000)
....:
1000 loops, best of 3: 736 µs per loop
In [7]: %%timeit # islice
....: check_list(lst, 5, 1000)
....:
1000 loops, best of 3: 662 µs per loop
In [9]: %%timeit # using list.index
....: check_list(lst, 5, 10000)
....:
100 loops, best of 3: 7.6 ms per loop
In [11]: %%timeit # islice
....: check_list(lst, 5, 10000)
....:
100 loops, best of 3: 6.7 ms per loop
You could use the second argument of index
to find the subsequent indices of occurrences:您可以使用index
的第二个参数来查找后续出现的索引:
def check_list(l, x, n):
i = 0
try:
for _ in range(n):
i = l.index(x, i)+1
return True
except ValueError:
return False
print( check_list([1,3,2,3,4,0,8,3,7,3,1,1,0], 3, 4) )
index
arguments关于index
参数The official documentation does not mention in its Python Tutuorial, section 5 the method's second or third argument, but you can find it in the more comprehensive Python Standard Library, section 4.6 :官方文档在其Python 教程第 5 节中没有提及该方法的第二个或第三个参数,但您可以在更全面的Python 标准库第 4.6 节中找到它:
s.index(x[, i[, j]])
index of the first occurrence of x in s (at or after index i and before index j ) (8)s.index(x[, i[, j]])
x在s 中第一次出现的索引(在索引i处或之后和索引j之前) (8)(8)
index
raisesValueError
when x is not found in s . (8)当在s 中找不到x时,index
会引发ValueError
。 When supported, the additional arguments to the index method allow efficient searching of subsections of the sequence.如果支持,索引方法的附加参数允许有效搜索序列的子部分。 Passing the extra arguments is roughly equivalent to usings[i:j].index(x)
, only without copying any data and with the returned index being relative to the start of the sequence rather than the start of the slice.传递额外的参数大致相当于使用s[i:j].index(x)
,只是不复制任何数据并且返回的索引相对于序列的开始而不是切片的开始。
In comparing this list.index
method with the islice(gen)
method, the most important factor is the distance between the occurrences to be found.将此list.index
方法与islice(gen)
方法进行比较时,最重要的因素是要找到的出现之间的距离。 Once that distance is on average 13 or more, the list.index
has a better performance.一旦该距离平均为 13 或更多,则list.index
具有更好的性能。 For lower distances, the fastest method also depends on the number of occurrences to find.对于较短的距离,最快的方法还取决于要查找的出现次数。 The more occurrences to find, the sooner the islice(gen)
method outperforms list.index
in terms of average distance: this gain fades out when the number of occurrences becomes really large.找到的出现次数越多, islice(gen)
方法在平均距离方面的性能就list.index
优于list.index
:当出现次数变得非常大时,这种增益会逐渐消失。
The following graph draws the (approximate) border line, at which both methods perform equally well (the X-axis is logarithmic):下图绘制了(近似的)边界线,在该处两种方法的表现同样出色(X 轴为对数):
Ultimately short circuiting is the way to go if you expect a significant number of cases will lead to early termination.如果您预计大量案例将导致提前终止,则最终短路是可行的方法。 Let's explore the possibilities:让我们探索一下可能性:
Take the case of the list.index
method versus the list.count
method (these were the two fastest according to my testing, although ymmv)以list.index
方法与list.count
方法list.count
(根据我的测试,这是两个最快的方法,尽管是 ymmv)
For list.index
if the list contains n or more of x and the method is called n times.对于list.index
如果列表包含 n 个或更多 x 并且该方法被调用 n 次。 Whilst within the list.index method, execution is very fast, allowing for much faster iteration than the custom generator.虽然在 list.index 方法中,执行速度非常快,允许比自定义生成器更快的迭代。 If the occurances of x are far enough apart, a large speedup will be seen from the lower level execution of index
.如果 x 的出现相距足够远,则从index
的较低级别执行将看到很大的加速。 If instances of x are close together (shorter list / more common x's), much more of the time will be spent executing the slower python code that mediates the rest of the function (looping over n
and incrementing i
)如果 x 的实例靠近在一起(更短的列表/更常见的 x),则将花费更多的时间来执行调解函数其余部分的较慢的 Python 代码(循环n
并递增i
)
The benefit of list.count
is that it does all of the heavy lifting outside of slow python execution. list.count
的好处是它可以完成除缓慢的 Python 执行之外的所有繁重工作。 It is a much easier function to analyse, as it is simply a case of O(n) time complexity.这是一个更容易分析的函数,因为它只是 O(n) 时间复杂度的情况。 By spending almost none of the time in the python interpreter however it is almost gaurenteed to be faster for short lists.通过几乎不花时间在 python 解释器上,几乎可以保证短列表的速度更快。
Summary of selection criteria:选择标准概要:
list.count
较短的列表有利于list.count
list.count
不太可能短路的任何长度的列表偏爱list.count
list.index
长且可能短路的列表有利于list.index
I would recommend using Counter
from the collections
module.我建议使用collections
模块中的Counter
。
from collections import Counter
%%time
[k for k,v in Counter(np.random.randint(0,10000,10000000)).items() if v>1100]
#Output:
Wall time: 2.83 s
[1848, 1996, 2461, 4481, 4522, 5844, 7362, 7892, 9671, 9705]
This shows another way of doing it.这显示了另一种方法。
Find if the element at that index is the same as the item you want to find.查找该索引处的元素是否与您要查找的项目相同。
def check_list(l, x, n): _l = sorted(l) try: index_1 = _l.index(x) return _l[index_1 + n - 1] == x except IndexError: return False
c=0
for i in l:
if i==k:
c+=1
if c>=n:
print("true")
else:
print("false")
Another possibility might be:另一种可能是:
def check_list(l, x, n):
return sum([1 for i in l if i == x]) >= n
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.