[英]Finding difference between two very large lists
I have two very large lists, one is 331991 elements long, lets call that one a, and the other is 99171 elements long, call it b. 我有两个非常大的列表,一个长331991个元素,让我们称其为a,另一个长99171个元素,称其为b。 I want to compare a to b and then return a list of elements in a that are not in b. 我想将a与b进行比较,然后返回a中不在b中的元素的列表。 This also needs to be efficient as possible and in the order that they appear, that is probably a given but I thought I may as well throw it in there. 这也需要尽可能高效,并以它们出现的顺序(这可能是给定的),但我想我也可以将其扔在那里。
It can be done in O(m + n) time where m and n correspond to the lengths of the two lists: 可以在O(m + n)的时间内完成,其中m和n对应于两个列表的长度:
exclude = set(b) # O(m)
new_list = [x for x in a if x not in exclude] # O(n)
The key here is that sets have constant-time containment tests. 这里的关键是集合具有恒定时间的遏制测试。 Perhaps you could consider having b
be a set to begin with. 也许您可以考虑从b
开始。
See also: List Comprehension 另请参阅: 列表理解
Using your example : 使用您的示例 :
>>> a = ['a','b','c','d','e']
>>> b = ['a','b','c','f','g']
>>>
>>> exclude = set(b)
>>> new_list = [x for x in a if x not in exclude]
>>>
>>> new_list
['d', 'e']
Let us assume: 让我们假设:
book = ["once", "upon", "time", ...., "end", "of", "very", "long", "story"]
dct = ["alfa", "anaconda", .., "zeta-jones"]
And you want to remove from book list all the items, which are present in dct. 您想从书单中删除所有存在于dct中的项目。
Quick solution: 快速解决方案:
short_story = [word in book if word not in dct]
Speeding up searches in dct: turn dct into set - this has faster lookups: 加快dct中的搜索速度:将dct转换为set-这样可以加快查找速度:
dct = set(dct)
short_story = [word in book if word not in dct]
In case, the book is very long and does not fit into memory, you may process it word by word. 万一这本书很长并且不适合记忆,可以逐字处理。 For this, we may use a generator: 为此,我们可以使用一个生成器:
def story_words(fname):
"""fname is name of text file with a story"""
with open(fname) as f:
for line in f:
for word in line.split()
yield word
#print out shortened story
for word in story_words("alibaba.txt"):
if word not in dct:
print word
And in case, also your dictionary would be far too large, you would have to give up speed and iterate also over content of dictionary. 而且,如果您的字典也太大了,您将不得不放弃速度,并且还要对字典的内容进行迭代。 But this I skip for now. 但是,我暂时跳过。
Here's one way converting b
to a set, then filtering elements from a
that are not present: 这是将b
转换为集合,然后从a
中过滤不存在的元素的一种方法:
from itertools import ifilterfalse
a = ['a','b','c','d','e']
b = ['a','b','c']
c = list(ifilterfalse(set(b).__contains__, a))
# ['d', 'e']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.