[英]Fastest way to check if a value exists in a list
What is the fastest way to check if a value exists in a very large list?检查一个值是否存在于一个非常大的列表中的最快方法是什么?
7 in a
Clearest and fastest way to do it.最清晰和最快的方法。
You can also consider using a set
, but constructing that set from your list may take more time than faster membership testing will save.您也可以考虑使用
set
,但是从您的列表中构建该 set 可能需要比更快的成员资格测试节省的时间更多。 The only way to be certain is to benchmark well.唯一可以确定的方法是做好基准测试。 (this also depends on what operations you require)
(这也取决于您需要什么操作)
As stated by others, in
can be very slow for large lists.正如其他人所说,对于大型列表,
in
可能非常慢。 Here are some comparisons of the performances for in
, set
and bisect
.以下是
in
、 set
和bisect
的一些性能比较。 Note the time (in second) is in log scale.请注意时间(以秒为单位)是对数刻度。
Code for testing:测试代码:
import random
import bisect
import matplotlib.pyplot as plt
import math
import time
def method_in(a, b, c):
start_time = time.time()
for i, x in enumerate(a):
if x in b:
c[i] = 1
return time.time() - start_time
def method_set_in(a, b, c):
start_time = time.time()
s = set(b)
for i, x in enumerate(a):
if x in s:
c[i] = 1
return time.time() - start_time
def method_bisect(a, b, c):
start_time = time.time()
b.sort()
for i, x in enumerate(a):
index = bisect.bisect_left(b, x)
if index < len(a):
if x == b[index]:
c[i] = 1
return time.time() - start_time
def profile():
time_method_in = []
time_method_set_in = []
time_method_bisect = []
# adjust range down if runtime is too long or up if there are too many zero entries in any of the time_method lists
Nls = [x for x in range(10000, 30000, 1000)]
for N in Nls:
a = [x for x in range(0, N)]
random.shuffle(a)
b = [x for x in range(0, N)]
random.shuffle(b)
c = [0 for x in range(0, N)]
time_method_in.append(method_in(a, b, c))
time_method_set_in.append(method_set_in(a, b, c))
time_method_bisect.append(method_bisect(a, b, c))
plt.plot(Nls, time_method_in, marker='o', color='r', linestyle='-', label='in')
plt.plot(Nls, time_method_set_in, marker='o', color='b', linestyle='-', label='set')
plt.plot(Nls, time_method_bisect, marker='o', color='g', linestyle='-', label='bisect')
plt.xlabel('list size', fontsize=18)
plt.ylabel('log(time)', fontsize=18)
plt.legend(loc='upper left')
plt.yscale('log')
plt.show()
profile()
You could put your items into a set
.您可以将您的项目放入一个
set
中。 Set lookups are very efficient.集合查找非常有效。
Try:尝试:
s = set(a)
if 7 in s:
# do stuff
edit In a comment you say that you'd like to get the index of the element.编辑在评论中您说您想获取元素的索引。 Unfortunately, sets have no notion of element position.
不幸的是,集合没有元素位置的概念。 An alternative is to pre-sort your list and then use binary search every time you need to find an element.
另一种方法是对列表进行预排序,然后在每次需要查找元素时使用二进制搜索。
def check_availability(element, collection: iter):
return element in collection
Usage用法
check_availability('a', [1,2,3,4,'a','b','c'])
I believe this is the fastest way to know if a chosen value is in an array.我相信这是了解所选值是否在数组中的最快方法。
The original question was:原来的问题是:
What is the fastest way to know if a value exists in a list (a list with millions of values in it) and what its index is?
知道一个值是否存在于一个列表(一个包含数百万个值的列表)中以及它的索引是什么的最快方法是什么?
Thus there are two things to find:因此,有两件事要找到:
Towards this, I modified @xslittlegrass code to compute indexes in all cases, and added an additional method.为此,我修改了@xslittlegrass 代码以在所有情况下计算索引,并添加了一个附加方法。
Results结果
Methods are:方法是:
Results show that method 5 is the fastest.结果表明方法5是最快的。
Interestingly the try and the set methods are equivalent in time.有趣的是, try和set方法在时间上是等价的。
Test Code测试代码
import random
import bisect
import matplotlib.pyplot as plt
import math
import timeit
import itertools
def wrapper(func, *args, **kwargs):
" Use to produced 0 argument function for call it"
# Reference https://www.pythoncentral.io/time-a-python-function/
def wrapped():
return func(*args, **kwargs)
return wrapped
def method_in(a,b,c):
for i,x in enumerate(a):
if x in b:
c[i] = b.index(x)
else:
c[i] = -1
return c
def method_try(a,b,c):
for i, x in enumerate(a):
try:
c[i] = b.index(x)
except ValueError:
c[i] = -1
def method_set_in(a,b,c):
s = set(b)
for i,x in enumerate(a):
if x in s:
c[i] = b.index(x)
else:
c[i] = -1
return c
def method_bisect(a,b,c):
" Finds indexes using bisection "
# Create a sorted b with its index
bsorted = sorted([(x, i) for i, x in enumerate(b)], key = lambda t: t[0])
for i,x in enumerate(a):
index = bisect.bisect_left(bsorted,(x, ))
c[i] = -1
if index < len(a):
if x == bsorted[index][0]:
c[i] = bsorted[index][1] # index in the b array
return c
def method_reverse_lookup(a, b, c):
reverse_lookup = {x:i for i, x in enumerate(b)}
for i, x in enumerate(a):
c[i] = reverse_lookup.get(x, -1)
return c
def profile():
Nls = [x for x in range(1000,20000,1000)]
number_iterations = 10
methods = [method_in, method_try, method_set_in, method_bisect, method_reverse_lookup]
time_methods = [[] for _ in range(len(methods))]
for N in Nls:
a = [x for x in range(0,N)]
random.shuffle(a)
b = [x for x in range(0,N)]
random.shuffle(b)
c = [0 for x in range(0,N)]
for i, func in enumerate(methods):
wrapped = wrapper(func, a, b, c)
time_methods[i].append(math.log(timeit.timeit(wrapped, number=number_iterations)))
markers = itertools.cycle(('o', '+', '.', '>', '2'))
colors = itertools.cycle(('r', 'b', 'g', 'y', 'c'))
labels = itertools.cycle(('in', 'try', 'set', 'bisect', 'reverse'))
for i in range(len(time_methods)):
plt.plot(Nls,time_methods[i],marker = next(markers),color=next(colors),linestyle='-',label=next(labels))
plt.xlabel('list size', fontsize=18)
plt.ylabel('log(time)', fontsize=18)
plt.legend(loc = 'upper left')
plt.show()
profile()
a = [4,2,3,1,5,6]
index = dict((y,x) for x,y in enumerate(a))
try:
a_index = index[7]
except KeyError:
print "Not found"
else:
print "found"
This will only be a good idea if a doesn't change and thus we can do the dict() part once and then use it repeatedly.如果 a 没有改变,这将是一个好主意,因此我们可以执行一次 dict() 部分,然后重复使用它。 If a does change, please provide more detail on what you are doing.
如果确实发生了变化,请提供有关您正在做什么的更多详细信息。
Be aware that the in
operator tests not only equality ( ==
) but also identity ( is
), the in
logic for list
s is roughly equivalent to the following (it's actually written in C and not Python though, at least in CPython):请注意,
in
运算符不仅测试相等 ( ==
) 还测试身份 ( is
), list
的in
逻辑大致等价于以下内容(尽管它实际上是用 C 而不是 Python 编写的,至少在 CPython 中是这样):
for element in s: if element is target: # fast check for identity implies equality return True if element == target: # slower check for actual equality return True return False
In most circumstances this detail is irrelevant, but in some circumstances it might leave a Python novice surprised, for example, numpy.NAN
has the unusual property of being not being equal to itself :在大多数情况下,这个细节是无关紧要的,但在某些情况下,它可能会让 Python 新手感到惊讶,例如,
numpy.NAN
具有不等于自身的不寻常属性:
>>> import numpy
>>> numpy.NAN == numpy.NAN
False
>>> numpy.NAN is numpy.NAN
True
>>> numpy.NAN in [numpy.NAN]
True
To distinguish between these unusual cases you could use any()
like:要区分这些不寻常的情况,您可以使用
any()
,例如:
>>> lst = [numpy.NAN, 1 , 2]
>>> any(element == numpy.NAN for element in lst)
False
>>> any(element is numpy.NAN for element in lst)
True
Note the in
logic for list
s with any()
would be:请注意,带有
any()
的list
的in
逻辑将是:
any(element is target or element == target for element in lst)
However, I should emphasize that this is an edge case, and for the vast majority of cases the in
operator is highly optimised and exactly what you want of course (either with a list
or with a set
).但是,我应该强调这是一个边缘情况,并且对于绝大多数情况,
in
运算符都经过高度优化,并且当然正是您想要的(使用list
或使用set
)。
If you only want to check the existence of one element in a list,如果您只想检查列表中是否存在一个元素,
7 in list_data
is the fastest solution.是最快的解决方案。 Note though that
请注意,尽管
7 in set_data
is a near-free operation, independently of the size of the set!是一种近乎自由的操作,与集合的大小无关! Creating a set from a large list is 300 to 400 times slower than
in
, so if you need to check for many elements, creating a set first is faster.从大列表创建集合比
in
慢 300 到 400 倍,因此如果需要检查许多元素,首先创建集合更快。
Plot created with perfplot :使用perfplot创建的绘图:
import perfplot
import numpy as np
def setup(n):
data = np.arange(n)
np.random.shuffle(data)
return data, set(data)
def list_in(data):
return 7 in data[0]
def create_set_from_list(data):
return set(data[0])
def set_in(data):
return 7 in data[1]
b = perfplot.bench(
setup=setup,
kernels=[list_in, set_in, create_set_from_list],
n_range=[2 ** k for k in range(24)],
xlabel="len(data)",
equality_check=None,
)
b.save("out.png")
b.show()
It sounds like your application might gain advantage from the use of a Bloom Filter data structure.听起来您的应用程序可能会从使用 Bloom Filter 数据结构中获益。
In short, a bloom filter look-up can tell you very quickly if a value is DEFINITELY NOT present in a set.简而言之,布隆过滤器查找可以非常快速地告诉您某个值是否绝对不存在于集合中。 Otherwise, you can do a slower look-up to get the index of a value that POSSIBLY MIGHT BE in the list.
否则,您可以进行较慢的查找以获取可能在列表中的值的索引。 So if your application tends to get the "not found" result much more often then the "found" result, you might see a speed up by adding a Bloom Filter.
因此,如果您的应用程序往往比“找到”结果更频繁地获得“未找到”结果,您可能会通过添加布隆过滤器看到加速。
For details, Wikipedia provides a good overview of how Bloom Filters work, and a web search for "python bloom filter library" will provide at least a couple useful implementations.有关详细信息,Wikipedia 很好地概述了布隆过滤器的工作原理,并且在网络上搜索“python 布隆过滤器库”将提供至少几个有用的实现。
Or use __contains__
:或使用
__contains__
:
sequence.__contains__(value)
Demo:演示:
>>> l = [1, 2, 3]
>>> l.__contains__(3)
True
>>>
This is not the code, but the algorithm for very fast searching.这不是代码,而是用于非常快速搜索的算法。
If your list and the value you are looking for are all numbers, this is pretty straightforward.如果您的列表和您要查找的值都是数字,那么这非常简单。 If strings: look at the bottom:
如果字符串:查看底部:
If you also need the original position of your number, look for it in the second, index column.如果您还需要号码的原始位置,请在第二个索引列中查找。
If your list is not made of numbers, the method still works and will be fastest, but you may need to define a function which can compare/order strings.如果您的列表不是由数字组成的,该方法仍然有效并且速度最快,但您可能需要定义一个可以比较/排序字符串的函数。
Of course, this needs the investment of the sorted() method, but if you keep reusing the same list for checking, it may be worth it.当然,这需要 sorted() 方法的投入,但如果你一直重复使用同一个列表进行检查,这可能是值得的。
Because the question is not always supposed to be understood as the fastest technical way - I always suggest the most straightforward fastest way to understand/write: a list comprehension, one-liner因为问题并不总是应该被理解为最快的技术方式 - 我总是建议最直接的最快方式来理解/编写:列表理解,单行
[i for i in list_from_which_to_search if i in list_to_search_in]
I had a list_to_search_in
with all the items, and wanted to return the indexes of the items in the list_from_which_to_search
.我有一个包含所有项目的
list_to_search_in
,并希望返回list_from_which_to_search
中项目的索引。
This returns the indexes in a nice list.这将返回一个漂亮列表中的索引。
There are other ways to check this problem - however list comprehensions are quick enough, adding to the fact of writing it quick enough, to solve a problem.还有其他方法可以检查这个问题 - 但是列表推导足够快,加上编写它足够快的事实,以解决问题。
@Winston Ewert's solution yields a big speed-up for very large lists, but this stackoverflow answer indicates that the the try:/except:/else: construct will be slowed down if the except branch is often reached. @Winston Ewert的解决方案极大地提高了非常大的列表的速度,但是这个stackoverflow答案表明,如果经常到达除外分支,则try:/ except:/ else:构造将变慢。 An alternative is to take advantage of the
.get()
method for the dict: 另一种方法是将
.get()
方法用于dict:
a = [4,2,3,1,5,6]
index = dict((y, x) for x, y in enumerate(a))
b = index.get(7, None)
if b is not None:
"Do something with variable b"
The .get(key, default)
method is just for the case when you can't guarantee a key will be in the dict. .get(key, default)
方法仅适用于无法保证键会包含在字典中的情况。 If key is present, it returns the value (as would dict[key]
), but when it is not, .get()
returns your default value (here None
). 如果项存在 ,则返回值(如将
dict[key]
),但是当它不是, .get()
返回默认值(此处None
)。 You need to make sure in this case that the chosen default will not be in a
. 你需要确保在这种情况下所选择的默认不会是
a
。
present = False
searchItem = 'd'
myList = ['a', 'b', 'c', 'd', 'e']
if searchItem in myList:
present = True
print('present = ', present)
else:
print('present = ', present)
i think it's good我觉得很好
mylist = [j for j in range(100)]
value = 13 #mutable
print (value in mylist)
#output: True
if you wanna print the value:如果你想打印值:
mylist = [j for j in range(100)]
value = 13 #mutable
if value in mylist:
print (value)
There are probably faster algorithms for handling spatial data (eg refactoring to use a kd tree), but the special case of checking if a vector is in an array is useful:可能有更快的算法来处理空间数据(例如重构以使用 kd 树),但检查向量是否在数组中的特殊情况很有用:
In this case, I was interested in knowing if an (undirected) edge defined by two points was in a collection of (undirected) edges, such that在这种情况下,我想知道由两个点定义的(无向)边是否在(无向)边的集合中,这样
(pair in unique_pairs) | (pair[::-1] in unique_pairs) for pair in pairs
where pair
constitutes two vectors of arbitrary length (ie shape (2,N)
).其中
pair
构成任意长度的两个向量(即形状(2,N)
)。
If the distance between these vectors is meaningful, then the test can be expressed by a floating point inequality like如果这些向量之间的距离有意义,那么测试可以用浮点不等式表示
test_result = Norm(v1 - v2) < Tol
and "Value exists in List" is simply any(test_result)
.并且“列表中存在值”只是
any(test_result)
。
Example code and dummy test set generators for integer pairs and R3 vector pairs are below. integer 对和 R3 向量对的示例代码和虚拟测试集生成器如下所示。
# 3rd party
import numpy as np
import numpy.linalg as LA
import matplotlib.pyplot as plt
# optional
try:
from tqdm import tqdm
except ModuleNotFoundError:
def tqdm(X, *args, **kwargs):
return X
print('tqdm not found. tqdm is a handy progress bar module.')
def get_float_r3_pairs(size):
""" generate dummy vector pairs in R3 (i.e. case of spatial data) """
coordinates = np.random.random(size=(size, 3))
pairs = []
for b in coordinates:
for a in coordinates:
pairs.append((a,b))
pairs = np.asarray(pairs)
return pairs
def get_int_pairs(size):
""" generate dummy integer pairs (i.e. case of array masking) """
coordinates = np.random.randint(0, size, size)
pairs = []
for b in coordinates:
for a in coordinates:
pairs.append((a,b))
pairs = np.asarray(pairs)
return pairs
def float_tol_pair_in_pairs(pair:np.ndarray, pairs:np.ndarray) -> np.ndarray:
"""
True if abs(a0 - b0) <= tol & abs(a1 - b1) <= tol for (ai1, aj2), (bi1, bj2)
in [(a01, a02), ... (aik, ajl)]
NB this is expected to be called in iteration so no sanitization is performed.
Parameters
----------
pair : np.ndarray
pair of vectors with shape (2, M)
pairs : np.ndarray
collection of vector pairs with shape (N, 2, M)
Returns
-------
np.ndarray
(pair in pairs) | (pair[::-1] in pairs).
"""
m1 = np.sum( abs(LA.norm(pairs - pair, axis=2)) <= (1e-03, 1e-03), axis=1 ) == 2
m2 = np.sum( abs(LA.norm(pairs - pair[::-1], axis=2)) <= (1e-03, 1e-03), axis=1 ) == 2
return m1 | m2
def get_unique_pairs(pairs:np.ndarray) -> np.ndarray:
"""
apply float_tol_pair_in_pairs for pair in pairs
Parameters
----------
pairs : np.ndarray
collection of vector pairs with shape (N, 2, M)
Returns
-------
np.ndarray
pair if not ((pair in rv) | (pair[::-1] in rv)) for pair in pairs
"""
pairs = np.asarray(pairs).reshape((len(pairs), 2, -1))
rv = [pairs[0]]
for pair in tqdm(pairs[1:], desc='finding unique pairs...'):
if not any(float_tol_pair_in_pairs(pair, rv)):
rv.append(pair)
return np.array(rv)
For me it was 0.030 sec (real), 0.026 sec (user), and 0.004 sec (sys). 对我来说,这是0.030秒(实际),0.026秒(用户)和0.004秒(系统)。
try:
print("Started")
x = ["a", "b", "c", "d", "e", "f"]
i = 0
while i < len(x):
i += 1
if x[i] == "e":
print("Found")
except IndexError:
pass
What is the fastest way to know if a value exists in a list (a list with millions of values in it) and what its index is?知道列表中是否存在值(列表中包含数百万个值)及其索引是什么的最快方法是什么?
I know that all values in the list are unique as in this example.我知道列表中的所有值都是唯一的,如本例所示。
The first method I try is (3.8 sec in my real code):我尝试的第一种方法是(在我的实际代码中为3.8秒):
a = [4,2,3,1,5,6]
if a.count(7) == 1:
b=a.index(7)
"Do something with variable b"
The second method I try is (2x faster: 1.9 sec for my real code):我尝试的第二种方法是(速度提高了2倍:实际代码为1.9秒):
a = [4,2,3,1,5,6]
try:
b=a.index(7)
except ValueError:
"Do nothing"
else:
"Do something with variable b"
Proposed methods from Stack Overflow user (2.74 sec for my real code):堆栈溢出用户建议的方法(我的实际代码为2.74秒):
a = [4,2,3,1,5,6]
if 7 in a:
a.index(7)
In my real code, the first method takes 3.81 sec and the second method takes 1.88 sec.在我的真实代码中,第一种方法耗时3.81秒,第二种方法耗时1.88秒。 It's a good improvement, but:
这是一个很好的改进,但是:
I'm a beginner with Python/scripting, and is there a faster way to do the same things and save more processing time?我是使用Python /脚本的初学者,有没有更快的方法来做相同的事情并节省更多的处理时间?
More specific explanation for my application:针对我的应用的更具体的解释:
In the Blender API I can access a list of particles:在Blender API中,我可以访问粒子列表:
particles = [1, 2, 3, 4, etc.]
From there, I can access a particle's location:从那里,我可以访问粒子的位置:
particles[x].location = [x,y,z]
And for each particle I test if a neighbour exists by searching each particle location like so:对于每个粒子,我通过搜索每个粒子位置来测试是否存在邻居,如下所示:
if [x+1,y,z] in particles.location
"Find the identity of this neighbour particle in x:the particle's index
in the array"
particles.index([x+1,y,z])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.