简体   繁体   English

检查列表中是否存在值的最快方法

[英]Fastest way to check if a value exists in a list

What is the fastest way to check if a value exists in a very large list?检查一个值是否存在于一个非常大的列表中的最快方法是什么?

7 in a

Clearest and fastest way to do it.最清晰和最快的方法。

You can also consider using a set , but constructing that set from your list may take more time than faster membership testing will save.您也可以考虑使用set ,但是从您的列表中构建该 set 可能需要比更快的成员资格测试节省的时间更多。 The only way to be certain is to benchmark well.唯一可以确定的方法是做好基准测试。 (this also depends on what operations you require) (这也取决于您需要什么操作)

As stated by others, in can be very slow for large lists.正如其他人所说,对于大型列表, in可能非常慢。 Here are some comparisons of the performances for in , set and bisect .以下是insetbisect的一些性能比较。 Note the time (in second) is in log scale.请注意时间(以秒为单位)是对数刻度。

在此处输入图像描述

Code for testing:测试代码:

import random
import bisect
import matplotlib.pyplot as plt
import math
import time


def method_in(a, b, c):
    start_time = time.time()
    for i, x in enumerate(a):
        if x in b:
            c[i] = 1
    return time.time() - start_time


def method_set_in(a, b, c):
    start_time = time.time()
    s = set(b)
    for i, x in enumerate(a):
        if x in s:
            c[i] = 1
    return time.time() - start_time


def method_bisect(a, b, c):
    start_time = time.time()
    b.sort()
    for i, x in enumerate(a):
        index = bisect.bisect_left(b, x)
        if index < len(a):
            if x == b[index]:
                c[i] = 1
    return time.time() - start_time


def profile():
    time_method_in = []
    time_method_set_in = []
    time_method_bisect = []

    # adjust range down if runtime is too long or up if there are too many zero entries in any of the time_method lists
    Nls = [x for x in range(10000, 30000, 1000)]
    for N in Nls:
        a = [x for x in range(0, N)]
        random.shuffle(a)
        b = [x for x in range(0, N)]
        random.shuffle(b)
        c = [0 for x in range(0, N)]

        time_method_in.append(method_in(a, b, c))
        time_method_set_in.append(method_set_in(a, b, c))
        time_method_bisect.append(method_bisect(a, b, c))

    plt.plot(Nls, time_method_in, marker='o', color='r', linestyle='-', label='in')
    plt.plot(Nls, time_method_set_in, marker='o', color='b', linestyle='-', label='set')
    plt.plot(Nls, time_method_bisect, marker='o', color='g', linestyle='-', label='bisect')
    plt.xlabel('list size', fontsize=18)
    plt.ylabel('log(time)', fontsize=18)
    plt.legend(loc='upper left')
    plt.yscale('log')
    plt.show()


profile()

You could put your items into a set .您可以将您的项目放入一个set中。 Set lookups are very efficient.集合查找非常有效。

Try:尝试:

s = set(a)
if 7 in s:
  # do stuff

edit In a comment you say that you'd like to get the index of the element.编辑在评论中您说您想获取元素的索引。 Unfortunately, sets have no notion of element position.不幸的是,集合没有元素位置的概念。 An alternative is to pre-sort your list and then use binary search every time you need to find an element.另一种方法是对列表进行预排序,然后在每次需要查找元素时使用二进制搜索

def check_availability(element, collection: iter):
    return element in collection

Usage用法

check_availability('a', [1,2,3,4,'a','b','c'])

I believe this is the fastest way to know if a chosen value is in an array.我相信这是了解所选值是否在数组中的最快方法。

The original question was:原来的问题是:

What is the fastest way to know if a value exists in a list (a list with millions of values in it) and what its index is?知道一个值是否存在于一个列表(一个包含数百万个值的列表)中以及它的索引是什么的最快方法是什么?

Thus there are two things to find:因此,有两件事要找到:

  1. is an item in the list, and是列表中的一个项目,并且
  2. what is the index (if in the list).什么是索引(如果在列表中)。

Towards this, I modified @xslittlegrass code to compute indexes in all cases, and added an additional method.为此,我修改了@xslittlegrass 代码以在所有情况下计算索引,并添加了一个附加方法。

Results结果

在此处输入图像描述

Methods are:方法是:

  1. in--basically if x in b: return b.index(x) in-- 基本上如果 x in b: return b.index(x)
  2. try--try/catch on b.index(x) (skips having to check if x in b) try--try/catch on b.index(x) (跳过检查 x 是否在 b 中)
  3. set--basically if x in set(b): return b.index(x) set--基本上如果 x 在 set(b) 中:返回 b.index(x)
  4. bisect--sort b with its index, binary search for x in sorted(b). bisect-- 用它的索引对 b 排序,在 sorted(b) 中对 x 进行二分搜索。 Note mod from @xslittlegrass who returns the index in the sorted b, rather than the original b)注意来自 @xslittlegrass 的 mod,它返回排序后的 b 中的索引,而不是原始 b)
  5. reverse--form a reverse lookup dictionary d for b; reverse--为b形成一个反向查找字典d; then d[x] provides the index of x.然后 d[x] 提供 x 的索引。

Results show that method 5 is the fastest.结果表明方法5是最快的。

Interestingly the try and the set methods are equivalent in time.有趣的是, tryset方法在时间上是等价的。


Test Code测试代码

import random
import bisect
import matplotlib.pyplot as plt
import math
import timeit
import itertools

def wrapper(func, *args, **kwargs):
    " Use to produced 0 argument function for call it"
    # Reference https://www.pythoncentral.io/time-a-python-function/
    def wrapped():
        return func(*args, **kwargs)
    return wrapped

def method_in(a,b,c):
    for i,x in enumerate(a):
        if x in b:
            c[i] = b.index(x)
        else:
            c[i] = -1
    return c

def method_try(a,b,c):
    for i, x in enumerate(a):
        try:
            c[i] = b.index(x)
        except ValueError:
            c[i] = -1

def method_set_in(a,b,c):
    s = set(b)
    for i,x in enumerate(a):
        if x in s:
            c[i] = b.index(x)
        else:
            c[i] = -1
    return c

def method_bisect(a,b,c):
    " Finds indexes using bisection "

    # Create a sorted b with its index
    bsorted = sorted([(x, i) for i, x in enumerate(b)], key = lambda t: t[0])

    for i,x in enumerate(a):
        index = bisect.bisect_left(bsorted,(x, ))
        c[i] = -1
        if index < len(a):
            if x == bsorted[index][0]:
                c[i] = bsorted[index][1]  # index in the b array

    return c

def method_reverse_lookup(a, b, c):
    reverse_lookup = {x:i for i, x in enumerate(b)}
    for i, x in enumerate(a):
        c[i] = reverse_lookup.get(x, -1)
    return c

def profile():
    Nls = [x for x in range(1000,20000,1000)]
    number_iterations = 10
    methods = [method_in, method_try, method_set_in, method_bisect, method_reverse_lookup]
    time_methods = [[] for _ in range(len(methods))]

    for N in Nls:
        a = [x for x in range(0,N)]
        random.shuffle(a)
        b = [x for x in range(0,N)]
        random.shuffle(b)
        c = [0 for x in range(0,N)]

        for i, func in enumerate(methods):
            wrapped = wrapper(func, a, b, c)
            time_methods[i].append(math.log(timeit.timeit(wrapped, number=number_iterations)))

    markers = itertools.cycle(('o', '+', '.', '>', '2'))
    colors = itertools.cycle(('r', 'b', 'g', 'y', 'c'))
    labels = itertools.cycle(('in', 'try', 'set', 'bisect', 'reverse'))

    for i in range(len(time_methods)):
        plt.plot(Nls,time_methods[i],marker = next(markers),color=next(colors),linestyle='-',label=next(labels))

    plt.xlabel('list size', fontsize=18)
    plt.ylabel('log(time)', fontsize=18)
    plt.legend(loc = 'upper left')
    plt.show()

profile()
a = [4,2,3,1,5,6]

index = dict((y,x) for x,y in enumerate(a))
try:
   a_index = index[7]
except KeyError:
   print "Not found"
else:
   print "found"

This will only be a good idea if a doesn't change and thus we can do the dict() part once and then use it repeatedly.如果 a 没有改变,这将是一个好主意,因此我们可以执行一次 dict() 部分,然后重复使用它。 If a does change, please provide more detail on what you are doing.如果确实发生了变化,请提供有关您正在做什么的更多详细信息。

Be aware that the in operator tests not only equality ( == ) but also identity ( is ), the in logic for list s is roughly equivalent to the following (it's actually written in C and not Python though, at least in CPython):请注意, in运算符不仅测试相等 ( == ) 还测试身份 ( is ), listin逻辑大致等价于以下内容(尽管它实际上是用 C 而不是 Python 编写的,至少在 CPython 中是这样):

 for element in s: if element is target: # fast check for identity implies equality return True if element == target: # slower check for actual equality return True return False

In most circumstances this detail is irrelevant, but in some circumstances it might leave a Python novice surprised, for example, numpy.NAN has the unusual property of being not being equal to itself :在大多数情况下,这个细节是无关紧要的,但在某些情况下,它可能会让 Python 新手感到惊讶,例如, numpy.NAN具有不等于自身的不寻常属性:

>>> import numpy
>>> numpy.NAN == numpy.NAN
False
>>> numpy.NAN is numpy.NAN
True
>>> numpy.NAN in [numpy.NAN]
True

To distinguish between these unusual cases you could use any() like:要区分这些不寻常的情况,您可以使用any() ,例如:

>>> lst = [numpy.NAN, 1 , 2]
>>> any(element == numpy.NAN for element in lst)
False
>>> any(element is numpy.NAN for element in lst)
True 

Note the in logic for list s with any() would be:请注意,带有any()listin逻辑将是:

any(element is target or element == target for element in lst)

However, I should emphasize that this is an edge case, and for the vast majority of cases the in operator is highly optimised and exactly what you want of course (either with a list or with a set ).但是,我应该强调这是一个边缘情况,并且对于绝大多数情况, in运算符都经过高度优化,并且当然正是您想要的(使用list或使用set )。

If you only want to check the existence of one element in a list,如果您只想检查列表中是否存在一个元素,

7 in list_data

is the fastest solution.是最快的解决方案。 Note though that请注意,尽管

7 in set_data

is a near-free operation, independently of the size of the set!是一种近乎自由的操作,与集合的大小无关! Creating a set from a large list is 300 to 400 times slower than in , so if you need to check for many elements, creating a set first is faster.从大列表创建集合比in慢 300 到 400 倍,因此如果需要检查许多元素,首先创建集合更快。

在此处输入图像描述

Plot created with perfplot :使用perfplot创建的绘图:

import perfplot
import numpy as np


def setup(n):
    data = np.arange(n)
    np.random.shuffle(data)
    return data, set(data)


def list_in(data):
    return 7 in data[0]


def create_set_from_list(data):
    return set(data[0])


def set_in(data):
    return 7 in data[1]


b = perfplot.bench(
    setup=setup,
    kernels=[list_in, set_in, create_set_from_list],
    n_range=[2 ** k for k in range(24)],
    xlabel="len(data)",
    equality_check=None,
)
b.save("out.png")
b.show()

It sounds like your application might gain advantage from the use of a Bloom Filter data structure.听起来您的应用程序可能会从使用 Bloom Filter 数据结构中获益。

In short, a bloom filter look-up can tell you very quickly if a value is DEFINITELY NOT present in a set.简而言之,布隆过滤器查找可以非常快速地告诉您某个值是否绝对不存在于集合中。 Otherwise, you can do a slower look-up to get the index of a value that POSSIBLY MIGHT BE in the list.否则,您可以进行较慢的查找以获取可能在列表中的值的索引。 So if your application tends to get the "not found" result much more often then the "found" result, you might see a speed up by adding a Bloom Filter.因此,如果您的应用程序往往比“找到”结果更频繁地获得“未找到”结果,您可能会通过添加布隆过滤器看到加速。

For details, Wikipedia provides a good overview of how Bloom Filters work, and a web search for "python bloom filter library" will provide at least a couple useful implementations.有关详细信息,Wikipedia 很好地概述了布隆过滤器的工作原理,并且在网络上搜索“python 布隆过滤器库”将提供至少几个有用的实现。

Or use __contains__ :或使用__contains__

sequence.__contains__(value)

Demo:演示:

>>> l = [1, 2, 3]
>>> l.__contains__(3)
True
>>> 

This is not the code, but the algorithm for very fast searching.这不是代码,而是用于非常快速搜索的算法。

If your list and the value you are looking for are all numbers, this is pretty straightforward.如果您的列表和您要查找的值都是数字,那么这非常简单。 If strings: look at the bottom:如果字符串:查看底部:

  • -Let "n" be the length of your list -让“n”成为列表的长度
  • -Optional step: if you need the index of the element: add a second column to the list with current index of elements (0 to n-1) - see later -可选步骤:如果您需要元素的索引:将第二列添加到具有当前元素索引(0到n-1)的列表中 - 见下文
  • Order your list or a copy of it (.sort())订购您的列表或它的副本 (.sort())
  • Loop through:依次通过:
    • Compare your number to the n/2th element of the list将您的号码与列表的第 n/2 个元素进行比较
      • If larger, loop again between indexes n/2-n如果更大,则在索引 n/2-n 之间再次循环
      • If smaller, loop again between indexes 0-n/2如果更小,则在索引 0-n/2 之间再次循环
      • If the same: you found it如果相同:你找到了
  • Keep narrowing the list until you have found it or only have 2 numbers (below and above the one you are looking for)继续缩小列表,直到找到它或只有 2 个数字(在您要查找的数字的下方和上方)
  • This will find any element in at most 19 steps for a list of 1.000.000 (log(2)n to be precise)这将在最多 19 个步骤中找到 1.000.000 列表中的任何元素(准确地说是 log(2)n)

If you also need the original position of your number, look for it in the second, index column.如果您还需要号码的原始位置,请在第二个索引列中查找。

If your list is not made of numbers, the method still works and will be fastest, but you may need to define a function which can compare/order strings.如果您的列表不是由数字组成的,该方法仍然有效并且速度最快,但您可能需要定义一个可以比较/排序字符串的函数。

Of course, this needs the investment of the sorted() method, but if you keep reusing the same list for checking, it may be worth it.当然,这需要 sorted() 方法的投入,但如果你一直重复使用同一个列表进行检查,这可能是值得的。

Because the question is not always supposed to be understood as the fastest technical way - I always suggest the most straightforward fastest way to understand/write: a list comprehension, one-liner因为问题并不总是应该被理解为最快的技术方式 - 我总是建议最直接的最快方式来理解/编写:列表理解,单行

[i for i in list_from_which_to_search if i in list_to_search_in]

I had a list_to_search_in with all the items, and wanted to return the indexes of the items in the list_from_which_to_search .我有一个包含所有项目的list_to_search_in ,并希望返回list_from_which_to_search中项目的索引。

This returns the indexes in a nice list.这将返回一个漂亮列表中的索引。

There are other ways to check this problem - however list comprehensions are quick enough, adding to the fact of writing it quick enough, to solve a problem.还有其他方法可以检查这个问题 - 但是列表推导足够快,加上编写它足够快的事实,以解决问题。

@Winston Ewert's solution yields a big speed-up for very large lists, but this stackoverflow answer indicates that the the try:/except:/else: construct will be slowed down if the except branch is often reached. @Winston Ewert的解决方案极大地提高了非常大的列表的速度,但是这个stackoverflow答案表明,如果经常到达除外分支,则try:/ except:/ else:构造将变慢。 An alternative is to take advantage of the .get() method for the dict: 另一种方法是将.get()方法用于dict:

a = [4,2,3,1,5,6]

index = dict((y, x) for x, y in enumerate(a))

b = index.get(7, None)
if b is not None:
    "Do something with variable b"

The .get(key, default) method is just for the case when you can't guarantee a key will be in the dict. .get(key, default)方法仅适用于无法保证键会包含在字典中的情况。 If key is present, it returns the value (as would dict[key] ), but when it is not, .get() returns your default value (here None ). 如果项存在 ,则返回值(如将dict[key] ),但是当它不是, .get()返回默认值(此处None )。 You need to make sure in this case that the chosen default will not be in a . 你需要确保在这种情况下所选择的默认不会是a

present = False
searchItem = 'd'
myList = ['a', 'b', 'c', 'd', 'e']
if searchItem in myList:
   present = True
   print('present = ', present)
else:
   print('present = ', present)

i think it's good我觉得很好

mylist = [j for j in range(100)]
value = 13 #mutable
print (value in mylist)
#output: True

if you wanna print the value:如果你想打印值:

mylist = [j for j in range(100)]
value = 13 #mutable
if value in mylist:
    print (value)

Edge case for spatial data空间数据的边缘情况

There are probably faster algorithms for handling spatial data (eg refactoring to use a kd tree), but the special case of checking if a vector is in an array is useful:可能有更快的算法来处理空间数据(例如重构以使用 kd 树),但检查向量是否在数组中的特殊情况很有用:

  • If you have spatial data (ie cartesian coordinates)如果你有空间数据(即笛卡尔坐标)
  • If you have integer masks (ie array filtering)如果你有integer个掩码(即数组过滤)

In this case, I was interested in knowing if an (undirected) edge defined by two points was in a collection of (undirected) edges, such that在这种情况下,我想知道由两个点定义的(无向)边是否在(无向)边的集合中,这样

(pair in unique_pairs) | (pair[::-1] in unique_pairs) for pair in pairs

where pair constitutes two vectors of arbitrary length (ie shape (2,N) ).其中pair构成任意长度的两个向量(即形状(2,N) )。

If the distance between these vectors is meaningful, then the test can be expressed by a floating point inequality like如果这些向量之间的距离有意义,那么测试可以用浮点不等式表示

test_result = Norm(v1 - v2) < Tol

and "Value exists in List" is simply any(test_result) .并且“列表中存在值”只是any(test_result)

Example code and dummy test set generators for integer pairs and R3 vector pairs are below. integer 对和 R3 向量对的示例代码和虚拟测试集生成器如下所示。

# 3rd party
import numpy as np
import numpy.linalg as LA
import matplotlib.pyplot as plt

# optional
try:
    from tqdm import tqdm
except ModuleNotFoundError:
    def tqdm(X, *args, **kwargs):
        return X
    print('tqdm not found. tqdm is a handy progress bar module.')
    

def get_float_r3_pairs(size):
    """ generate dummy vector pairs in R3  (i.e. case of spatial data) """
    coordinates = np.random.random(size=(size, 3))
    pairs = []
    for b in coordinates:
        for a in coordinates:
            pairs.append((a,b))
    pairs = np.asarray(pairs)
    return pairs
    
        
def get_int_pairs(size):
    """ generate dummy integer pairs (i.e. case of array masking) """
    coordinates = np.random.randint(0, size, size)
    pairs = []
    for b in coordinates:
        for a in coordinates:
            pairs.append((a,b))
    pairs = np.asarray(pairs)
    return pairs


def float_tol_pair_in_pairs(pair:np.ndarray, pairs:np.ndarray) -> np.ndarray:
    """
    True if abs(a0 - b0) <= tol & abs(a1 - b1) <= tol for (ai1, aj2), (bi1, bj2)
    in [(a01, a02), ... (aik, ajl)]
    
    NB this is expected to be called in iteration so no sanitization is performed.

    Parameters
    ----------
    pair : np.ndarray
        pair of vectors with shape (2, M)
    pairs : np.ndarray
        collection of vector pairs with shape (N, 2, M)

    Returns
    -------
    np.ndarray
        (pair in pairs) | (pair[::-1] in pairs).
    """
    m1 = np.sum( abs(LA.norm(pairs - pair, axis=2)) <= (1e-03, 1e-03), axis=1 ) == 2
    m2 = np.sum( abs(LA.norm(pairs - pair[::-1], axis=2)) <= (1e-03, 1e-03), axis=1 ) == 2
    return m1 | m2


def get_unique_pairs(pairs:np.ndarray) -> np.ndarray:
    """
    apply float_tol_pair_in_pairs for pair in pairs
    
    Parameters
    ----------
    pairs : np.ndarray
        collection of vector pairs with shape (N, 2, M)

    Returns
    -------
    np.ndarray
        pair if not ((pair in rv) | (pair[::-1] in rv)) for pair in pairs

    """
    pairs = np.asarray(pairs).reshape((len(pairs), 2, -1))
    rv = [pairs[0]]
    for pair in tqdm(pairs[1:], desc='finding unique pairs...'):
        if not any(float_tol_pair_in_pairs(pair, rv)):
            rv.append(pair)
    return np.array(rv)

计时结果

For me it was 0.030 sec (real), 0.026 sec (user), and 0.004 sec (sys). 对我来说,这是0.030秒(实际),0.026秒(用户)和0.004秒(系统)。

try:
print("Started")
x = ["a", "b", "c", "d", "e", "f"]

i = 0

while i < len(x):
    i += 1
    if x[i] == "e":
        print("Found")
except IndexError:
    pass

What is the fastest way to know if a value exists in a list (a list with millions of values in it) and what its index is?知道列表中是否存在值(列表中包含数百万个值)及其索引是什么的最快方法是什么?

I know that all values in the list are unique as in this example.我知道列表中的所有值都是唯一的,如本例所示。

The first method I try is (3.8 sec in my real code):我尝试的第一种方法是(在我的实际代码中为3.8秒):

a = [4,2,3,1,5,6]

if a.count(7) == 1:
    b=a.index(7)
    "Do something with variable b"

The second method I try is (2x faster: 1.9 sec for my real code):我尝试的第二种方法是(速度提高了2倍:实际代码为1.9秒):

a = [4,2,3,1,5,6]

try:
    b=a.index(7)
except ValueError:
    "Do nothing"
else:
    "Do something with variable b"

Proposed methods from Stack Overflow user (2.74 sec for my real code):堆栈溢出用户建议的方法(我的实际代码为2.74秒):

a = [4,2,3,1,5,6]
if 7 in a:
    a.index(7)

In my real code, the first method takes 3.81 sec and the second method takes 1.88 sec.在我的真实代码中,第一种方法耗时3.81秒,第二种方法耗时1.88秒。 It's a good improvement, but:这是一个很好的改进,但是:

I'm a beginner with Python/scripting, and is there a faster way to do the same things and save more processing time?我是使用Python /脚本的初学者,有没有更快的方法来做相同的事情并节省更多的处理时间?

More specific explanation for my application:针对我的应用的更具体的解释:

In the Blender API I can access a list of particles:在Blender API中,我可以访问粒子列表:

particles = [1, 2, 3, 4, etc.]

From there, I can access a particle's location:从那里,我可以访问粒子的位置:

particles[x].location = [x,y,z]

And for each particle I test if a neighbour exists by searching each particle location like so:对于每个粒子,我通过搜索每个粒子位置来测试是否存在邻居,如下所示:

if [x+1,y,z] in particles.location
    "Find the identity of this neighbour particle in x:the particle's index
    in the array"
    particles.index([x+1,y,z])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM