繁体   English   中英

通过比较python中的多个列表来删除/列出重复项

[英]remove/list duplicates by comparing multiple lists in python

我知道已经要求删除/列出列表中的重复项。 我在使它同时比较多个列表时遇到问题。

lst = [item1, item2, item3, item4, item5]
a = [1,2,1,5,1]
b = [2,0,2,5,2]
c = [0,1,0,1,5]

如果这些是我的列表,我想像使用zip函数一样比较它们。 我想检查列表中的索引0、2和4是否重复,如果那些相同的索引是其他列表的重复,那么例如列表b中的0、2和4也是重复的,但列表c中的0和2是唯一的重复因此,我只希望从结果列表中首先列出索引0和2 [item1,item3]

我将如何采用此定义来做到这一点?

def list_duplicates(seq):
  seen = set()
  seen_add = seen.add
  # adds all elements it doesn't know yet to seen and all other to seen_twice
  seen_twice = set( x for x in seq if x in seen or seen_add(x) )
  # turn the set into a list (as requested)
  return list( seen_twice )

a = [1,2,3,2,1,5,6,5,5,5]
list_duplicates(a) # yields [1, 2, 5]

您正在尝试确定哪些公共索引包含多个列表中的重复值,而不是跟踪重复值本身。 这意味着,除了跟踪给定seq重复的项目外,我们还需要跟踪找到重复项目的索引。 这很容易添加到现有方法中:

from collections import defaultdict

def list_duplicates(seq):
    seen = set()
    seen_twice = set()
    seen_indices = defaultdict(list)  # To keep track of seen indices
    for index, x in enumerate(seq):  # Can't use a comprehension now, too much logic in there.
        seen_indices[x].append(index)
        if x in seen:
            seen_twice.add(val)
        else:
            seen.add(val)
    print seen_indices
    return list( seen_twice )

if __name__ == "__main__":
    a = [1,2,3,2,1,5,6,5,5,5]   
    duped_items = list_duplicates(a)
    print duped_items

输出:

defaultdict(<type 'list'>, {1: [0, 4], 2: [1, 3], 3: [2], 5: [5, 7, 8, 9], 6: [6]})
[1, 2, 5]

因此,现在除了追踪值本身之外,我们还将追踪所有重复值的索引。

下一步是以某种方式将其应用于多个列表。 我们可以利用以下事实:遍历一个列表之后,我们将消除一堆我们不指向重复值的索引,而仅对后续列表中已知重复的索引进行迭代。 这需要稍微修改一下逻辑,以遍历“可能重复的索引”而不是遍历整个列表:

def list_duplicates2(*seqs):
    val_range = range(0, len(seqs[0]))  # At first, all indices could be duplicates.
    for seq in seqs:
        # Set up is the same as before.
        seen_items = set()
        seen_twice = set()
        seen_indices = defaultdict(list)
        for index in val_range:  # Iterate over the possibly duplicated indices, not the  whole sequence
            val = seq[index]
            seen_indices[val].append(index)
            if val in seen_items:
                seen_twice.add(val)
            else:
                seen_items.add(val)
        # Now that we've gone over the current valid_range, we can create a
        # new valid_range for the next iteration by only including the indices
        # in seq which contained values that we found at least twice in the 
        # current valid_range.
        val_range = [duped_index for seen_val in seen_twice for duped_index in seen_indices[seen_val]]
        print "new val_range is %s" % val_range
    return val_range

if __name__ == "__main__":
    a = [1,2,1,5,1]
    b = [2,0,2,5,2]
    c = [0,1,0,1,5]
    duped_indices = list_duplicates2(a, b, c)
    print "duped_indices is %s" % duped_indices

输出:

new val_range is [0, 2, 4]
new val_range is [0, 2, 4]
new val_range is [0, 2]
duped_indices is [0, 2]

正是您想要的。

在此列表中搜索重复项

l = [[a[i],b[i],c[i]] for i in range(len(a))]

对于您的示例,它将产生以下列表:

[[1, 2, 0], [2, 0, 1], [1, 2, 0], [5, 5, 1], [1, 2, 5]]

然后:

result = [lst[i] for (i,x) in enumerate(l) if x in list_duplicates(l)]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM