找到两个非常大的列表之间的差异

Question

I have two very large lists, one is 331991 elements long, lets call that one a, and the other is 99171 elements long, call it b. 我有两个非常大的列表，一个长331991个元素，让我们称其为a，另一个长99171个元素，称其为b。 I want to compare a to b and then return a list of elements in a that are not in b. 我想将a与b进行比较，然后返回a中不在b中的元素的列表。 This also needs to be efficient as possible and in the order that they appear, that is probably a given but I thought I may as well throw it in there. 这也需要尽可能高效，并以它们出现的顺序（这可能是给定的），但我想我也可以将其扔在那里。

Answer 1

It can be done in O(m + n) time where m and n correspond to the lengths of the two lists: 可以在O（m + n）的时间内完成，其中m和n对应于两个列表的长度：

exclude = set(b)  # O(m)

new_list = [x for x in a if x not in exclude]  # O(n)

The key here is that sets have constant-time containment tests. 这里的关键是集合具有恒定时间的遏制测试。 Perhaps you could consider having b be a set to begin with. 也许您可以考虑从b开始。

See also: List Comprehension 另请参阅：列表理解

Using your example : 使用您的示例：

>>> a = ['a','b','c','d','e']
>>> b = ['a','b','c','f','g']
>>> 
>>> exclude = set(b)
>>> new_list = [x for x in a if x not in exclude]
>>> 
>>> new_list
['d', 'e']

Answer 2

Let us assume: 让我们假设：

book = ["once", "upon", "time", ...., "end", "of", "very", "long", "story"]
dct = ["alfa", "anaconda", .., "zeta-jones"]

And you want to remove from book list all the items, which are present in dct. 您想从书单中删除所有存在于dct中的项目。

Quick solution: 快速解决方案：

short_story = [word in book if word not in dct]

Speeding up searches in dct: turn dct into set - this has faster lookups: 加快dct中的搜索速度：将dct转换为set-这样可以加快查找速度：

dct = set(dct)
short_story = [word in book if word not in dct]

In case, the book is very long and does not fit into memory, you may process it word by word. 万一这本书很长并且不适合记忆，可以逐字处理。 For this, we may use a generator: 为此，我们可以使用一个生成器：

def story_words(fname):
"""fname is name of text file with a story"""
  with open(fname) as f:
    for line in f:
      for word in line.split()
        yield word

#print out shortened story
for word in story_words("alibaba.txt"):
  if word not in dct:
    print word

And in case, also your dictionary would be far too large, you would have to give up speed and iterate also over content of dictionary. 而且，如果您的字典也太大了，您将不得不放弃速度，并且还要对字典的内容进行迭代。 But this I skip for now. 但是，我暂时跳过。

Answer 3

Here's one way converting b to a set, then filtering elements from a that are not present: 这是将b转换为集合，然后从a中过滤不存在的元素的一种方法：

from itertools import ifilterfalse

a = ['a','b','c','d','e']
b = ['a','b','c']
c = list(ifilterfalse(set(b).__contains__, a))
# ['d', 'e']

找到两个非常大的列表之间的差异

问题描述

3 个解决方案

解决方案1
7 已采纳 2013-10-09 21:59:18

解决方案2
1 2013-10-09 22:02:30

解决方案3
0 2013-10-09 21:59:41

找到两个非常大的列表之间的差异

问题描述

3 个解决方案

解决方案1 7 已采纳 2013-10-09 21:59:18

解决方案2 1 2013-10-09 22:02:30

解决方案3 0 2013-10-09 21:59:41

解决方案1
7 已采纳 2013-10-09 21:59:18

解决方案2
1 2013-10-09 22:02:30

解决方案3
0 2013-10-09 21:59:41