Python：如何有效地檢查項目是否在列表中？

Question

我有一個字符串列表（像這樣的單詞），當我解析文本時，我需要檢查一個單詞是否屬於我當前列表中的單詞組。

但是，我的輸入非常大（大約6億行），並且根據Python文檔檢查元素是否屬於列表是O（n）操作。

我的代碼是這樣的：

words_in_line = []
for word in line:
    if word in my_list:
        words_in_line.append(word)

由於花費了太多時間（實際上是幾天），我想改進大部分時間花費的那部分。 我看看Python集合，更准確地說，看看deque。 但是，只允許O（1）操作時間訪問列表的頭部和尾部，而不是在中間。

有人知道如何以更好的方式做到這一點嗎？

Answer 1

您可以考慮使用trie或DAWG或數據庫。 有幾個相同的Python實現。

以下是您考慮集合與列表的相關時間：

import timeit
import random

with open('/usr/share/dict/words','r') as di:  # UNIX 250k unique word list 
    all_words_set={line.strip() for line in di}

all_words_list=list(all_words_set)    # slightly faster if this list is sorted...      

test_list=[random.choice(all_words_list) for i in range(10000)] 
test_set=set(test_list)

def set_f():
    count = 0
    for word in test_set:
        if word in all_words_set: 
           count+=1
    return count

def list_f():
    count = 0
    for word in test_list:
        if word in all_words_list: 
           count+=1
    return count

def mix_f():
    # use list for source, set for membership testing
    count = 0
    for word in test_list:
        if word in all_words_set: 
           count+=1
    return count    

print "list:", timeit.Timer(list_f).timeit(1),"secs"
print "set:", timeit.Timer(set_f).timeit(1),"secs" 
print "mixed:", timeit.Timer(mix_f).timeit(1),"secs"

打印：

list: 47.4126560688 secs
set: 0.00277495384216 secs
mixed: 0.00166988372803 secs

即，將一組10000個單詞與一組250,000個單詞匹配比匹配相同250,000個單詞列表中相同10000個單詞的列表快17085 X. 使用源列表和成員資格測試集合比單獨的未排序列表快28,392 X.

對於成員資格測試，列表是O（n），並且set和dicts是O（1）用於查找。

結論：為6億行文本使用更好的數據結構！

Answer 2

這使用列表理解

words_in_line = [word for word in line if word in my_list]

這比你發布的代碼更有效，不過你的龐大數據集還有多少難以知曉。

Answer 3

我不清楚為什么你首先選擇一個列表，但這里有一些選擇：

使用set（）可能是一個好主意。 這是非常快的，雖然無序，但有時這正是所需要的。

如果您需要訂購的東西並進行任意查找，您可以使用某種樹： http ： //stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/

如果在這里設置少量誤報的成員資格測試或者可以接受，你可以查看一個布隆過濾器： http ： //stromberg.dnsalias.org/~strombrg/drs-bloom-filter/

根據你正在做的事情，特里可能也會非常好。

Answer 4

你可以在這里做兩個改進。

使用哈希表返回單詞列表。 當您檢查單詞列表中是否存在單詞時，這將為您提供O（1）性能。 有很多方法可以做到這一點; 這種情況下最合適的是將列表轉換為集合。
為匹配詞集合使用更合適的結構。
- 如果您需要同時在內存中存儲所有匹配項，請使用dequeue ，因為它的追加性能優於列表。
- 如果您不需要同時在內存中匹配所有匹配項，請考慮使用生成器。 生成器用於根據您指定的邏輯迭代匹配的值，但它一次只將結果列表的一部分存儲在內存中。 如果遇到I / O瓶頸，它可能會提供更好的性能。

下面是一個基於我的建議的示例實現（選擇生成器，因為我無法想象你需要在內存中同時使用所有這些單詞）。

from itertools import chain
d = set(['a','b','c']) # Load our dictionary
f = open('c:\\input.txt','r')
# Build a generator to get the words in the file
all_words_generator = chain.from_iterable(line.split() for line in f)
# Build a generator to filter out the non-dictionary words
matching_words_generator = (word for word in all_words_generator if word in d)
for matched_word in matching_words_generator:
    # Do something with matched_word
    print matched_word
# We're reading the file during the above loop, so don't close it too early
f.close()

input.txt中

a b dog cat
c dog poop
maybe b cat
dog

產量

a
b
c
b

Python：如何有效地檢查項目是否在列表中？

問題描述

4 個解決方案

解決方案1
14 已采納 2012-06-08 00:18:14

解決方案2
1 2012-06-08 00:02:35

解決方案3
1 2012-06-08 00:58:01

解決方案4
0 2012-06-08 00:47:15

Python：如何有效地檢查項目是否在列表中？

問題描述

4 個解決方案

解決方案1 14 已采納 2012-06-08 00:18:14

解決方案2 1 2012-06-08 00:02:35

解決方案3 1 2012-06-08 00:58:01

解決方案4 0 2012-06-08 00:47:15

解決方案1
14 已采納 2012-06-08 00:18:14

解決方案2
1 2012-06-08 00:02:35

解決方案3
1 2012-06-08 00:58:01

解決方案4
0 2012-06-08 00:47:15