简体   繁体   English

Python 如何处理检查“如果 object 在列表中”

[英]How does Python handle checking 'if object in list'

I'm wondering because I need to have have a function that is disgustingly fast at checking if a word is in a dictionary list - I'm considering leaving the dictionary as a large string and running regex against instead.我想知道,因为我需要有一个 function,它在检查一个单词是否在字典列表中时速度非常快——我正在考虑将字典作为一个大字符串并运行正则表达式来代替。 This needs to be absurdly fast.这需要非常快。 So I just need a basic overview of how python handles checking if a string is in a list of strings and if its beyond-reasonable fast.因此,我只需要对 python如何处理检查字符串是否在字符串列表中以及其是否超快的基本概述。

If you want a blazingly fast membership test, then a list is the wrong data structure.如果您想要一个极快的成员资格测试,那么列表是错误的数据结构。 Take a look at the implementation of list_contains in listobject.c , line 437 .看看list_containslistobject.c的实现,第 437 行 It iterates over the list in order, comparing the item with each element in turn.它按顺序遍历列表,依次将项目与每个元素进行比较。 The later the item appears in the list, the longer it will take to find it, and if the item is missing, then the whole list must be scanned.该项目出现在列表中的时间越晚,找到它所需的时间就越长,如果该项目丢失,则必须扫描整个列表。

Use a set instead.改用一套 Sets are implemented internally by a hash table, so looking up an object involves computing its hash and then scanning a few table entries (usually just one).集合由 hash 表在内部实现,因此查找 object 涉及计算其 hash,然后扫描几个表条目(通常只有一个)。 For the particular case of looking up a string, see set_lookkey_string in setobject.c , line 156 .对于查找字符串的特殊情况,请参见set_lookkey_string中的setobject.c ,第 156 行

A set of strings will have O(1) lookup time: effectively constant regardless of the size of the set.一组字符串将具有 O(1) 查找时间:无论集合的大小如何,实际上都是恒定的。 Making a set from your list of strings is easy:从您的字符串列表中创建一个集合很容易:

my_set = set(my_list)
if my_word in my_set:
    print "it's there!"

If you need real fast checking, use a set :如果您需要真正的快速检查,请使用set

words = set(words_list)
if "hello" in words:
    print("hello found!"")

A set is faster because it uses a hash-algorithm , instead of a direct search approach.集合更快,因为它使用哈希算法,而不是直接搜索方法。

According to this site , x in s is O(n).根据这个站点x in s是 O(n)。 Therefore, it checks each entry (in the worst case).因此,它检查每个条目(在最坏的情况下)。

At any rate, do not use a regex.无论如何,不要使用正则表达式。 Using sets or lists is a much more intuitive way to represent the data and regexes will not perform better than O(n).使用集合或列表是一种更直观的方式来表示数据,而正则表达式的性能不会比 O(n) 好。

If you're using a regular list, consider a set instead.如果您使用的是常规列表,请考虑使用一set

If you want to implement your own fine-tuned membership test for your container object, override __contains__ .如果您想为您的容器 object 实施您自己的微调成员资格测试,请覆盖__contains__

You probably want to use a Set if you're worried about time.如果您担心时间,您可能想使用 Set 。 A Set is much like a list, but it checks for membership based on hashing. Set 很像一个列表,但它基于散列检查成员资格。

Use a set.使用一套。 If you need case-insensitive checking, just store the words into the set downcased.如果您需要不区分大小写的检查,只需将单词存储到小写的集合中。 Then when checking if a certain word is in the set, downcase the word before checking membership.然后在检查某个单词是否在集合中时,在检查成员资格之前将该单词小写。

The general rule is: normalize entries when building the set, and normalize an item before checking against the set.一般规则是:在构建集合时规范化条目,并在检查集合之前规范化项目。 Another example of normalization is collapsing consecutive whitespace chars into a single space and stripping leading/trailing whitespace.规范化的另一个例子是将连续的空白字符折叠成一个空格并去除前导/尾随空格。

Running a regex against your word list is a very bad idea;对您的单词列表运行正则表达式是一个非常糟糕的主意。 it scales very badly.它的扩展性非常差。 Using dict() , set() or frozenset() will scale a lot better:使用dict()set()frozenset()会更好地扩展:

s = set(['one','two','three'])
'two' in s     ## true

b='four'
b in s         ## false

s.add('four')
b in s         ## true

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM