简体   繁体   English

找到两个非常大的列表之间的差异

[英]Finding difference between two very large lists

I have two very large lists, one is 331991 elements long, lets call that one a, and the other is 99171 elements long, call it b. 我有两个非常大的列表,一个长331991个元素,让我们称其为a,另一个长99171个元素,称其为b。 I want to compare a to b and then return a list of elements in a that are not in b. 我想将a与b进行比较,然后返回a中不在b中的元素的列表。 This also needs to be efficient as possible and in the order that they appear, that is probably a given but I thought I may as well throw it in there. 这也需要尽可能高效,并以它们出现的顺序(这可能是给定的),但我想我也可以将其扔在那里。

It can be done in O(m + n) time where m and n correspond to the lengths of the two lists: 可以在O(m + n)的时间内完成,其中m和n对应于两个列表的长度:

exclude = set(b)  # O(m)

new_list = [x for x in a if x not in exclude]  # O(n)

The key here is that sets have constant-time containment tests. 这里的关键是集合具有恒定时间的遏制测试。 Perhaps you could consider having b be a set to begin with. 也许您可以考虑从b开始。

See also: List Comprehension 另请参阅: 列表理解


Using your example : 使用您的示例

>>> a = ['a','b','c','d','e']
>>> b = ['a','b','c','f','g']
>>> 
>>> exclude = set(b)
>>> new_list = [x for x in a if x not in exclude]
>>> 
>>> new_list
['d', 'e']

Let us assume: 让我们假设:

book = ["once", "upon", "time", ...., "end", "of", "very", "long", "story"]
dct = ["alfa", "anaconda", .., "zeta-jones"]

And you want to remove from book list all the items, which are present in dct. 您想从书单中删除所有存在于dct中的项目。

Quick solution: 快速解决方案:

short_story = [word in book if word not in dct]

Speeding up searches in dct: turn dct into set - this has faster lookups: 加快dct中的搜索速度:将dct转换为set-这样可以加快查找速度:

dct = set(dct)
short_story = [word in book if word not in dct]

In case, the book is very long and does not fit into memory, you may process it word by word. 万一这本书很长并且不适合记忆,可以逐字处理。 For this, we may use a generator: 为此,我们可以使用一个生成器:

def story_words(fname):
"""fname is name of text file with a story"""
  with open(fname) as f:
    for line in f:
      for word in line.split()
        yield word

#print out shortened story
for word in story_words("alibaba.txt"):
  if word not in dct:
    print word

And in case, also your dictionary would be far too large, you would have to give up speed and iterate also over content of dictionary. 而且,如果您的字典也太大了,您将不得不放弃速度,并且还要对字典的内容进行迭代。 But this I skip for now. 但是,我暂时跳过。

Here's one way converting b to a set, then filtering elements from a that are not present: 这是将b转换为集合,然后从a中过滤不存在​​的元素的一种方法:

from itertools import ifilterfalse

a = ['a','b','c','d','e']
b = ['a','b','c']
c = list(ifilterfalse(set(b).__contains__, a))
# ['d', 'e']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM