简体   繁体   English

从列表列表中有效地删除重复项,与顺序无关

[英]Efficiently remove duplicates, order-agnostic, from list of lists

The following list has some duplicated sublists, with elements in different order:下面的列表有一些重复的子列表,元素的顺序不同:

l1 = [
    ['The', 'quick', 'brown', 'fox'],
    ['hi', 'there'],
    ['jumps', 'over', 'the', 'lazy', 'dog'],
    ['there', 'hi'],
    ['jumps', 'dog', 'over','lazy', 'the'],
]

How can I remove duplicates, retaining the first instance seen, to get:如何删除重复项,保留看到的第一个实例,以获得:

l1 = [
    ['The', 'quick', 'brown', 'fox'],
    ['hi', 'there'],
    ['jumps', 'over', 'the', 'lazy', 'dog'],
]

I tried to:我试过了:

[list(i) for i in set(map(tuple, l1))]

Nevertheless, I do not know if this is the fastest way of doing it for large lists, and my attempt is not working as desired.尽管如此,我不知道这是否是大型列表的最快方法,而且我的尝试没有按预期工作。 Any idea of how to remove them efficiently?知道如何有效地删除它们吗?

This one is a little tricky.这个有点棘手。 You want to key a dict off of frozen counters, but counters are not hashable in Python.您想从冻结的计数器中键入一个 dict,但计数器在 Python 中不可散列。 For a small degradation in the asymptotic complexity, you could use sorted tuples as a substitute for frozen counters:对于渐近复杂性的小幅下降,您可以使用排序元组作为冻结计数器的替代品:

seen = set()
result = []
for x in l1:
    key = tuple(sorted(x))
    if key not in seen:
        result.append(x)
        seen.add(key)

The same idea in a one-liner would look like this:单行中的相同想法如下所示:

[*{tuple(sorted(k)): k for k in reversed(l1)}.values()][::-1]

I did a quick benchmark, comparing the various answers:我做了一个快速基准测试,比较了各种答案:

l1 = [['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog'], ['there', 'hi'], ['jumps', 'dog', 'over','lazy', 'the']]

from collections import Counter

def method1():
    """manually construct set, keyed on sorted tuple"""
    seen = set()
    result = []
    for x in l1:
        key = tuple(sorted(x))
        if key not in seen:
            result.append(x)
            seen.add(key)
    return result

def method2():
    """frozenset-of-Counter"""
    return list({frozenset(Counter(lst).items()): lst for lst in reversed(l1)}.values())

def method3():
    """wim"""
    return [*{tuple(sorted(k)): k for k in reversed(l1)}.values()][::-1]

from timeit import timeit

print(timeit(lambda: method1(), number=1000))
print(timeit(lambda: method2(), number=1000))
print(timeit(lambda: method3(), number=1000))

Prints:印刷:

0.0025010189856402576
0.016385524009820074
0.0026451340527273715

@wim's answer is inefficient since it sorts the list items as a way to uniquely identify a set of counts of list items, which costs O(n log n) in time complexity for each sublist. @wim 的答案效率低下,因为它将列表项排序为唯一标识一组列表项计数的一种方式,每个子列表的时间复杂度为O(n log n)

To achieve the same in a linear time complexity, you can use a frozenset of counts of items with the collections.Counter class instead.为了以线性时间复杂度实现相同的效果,您可以使用带有collections.Counter类的项目计数的冻结collections.Counter Since dict comprehension retains the last value of items with duplicating keys, and since you want to retain the first value of items with duplicating keys in your question, you would have to construct the dict in reverse order of the list, and reverse it again after the list of de-duplicated sublists has been constructed:由于 dict comprehension 保留了带有重复键的项目的最后一个值,并且由于您希望在问题中保留带有重复键的项目的第一个值,因此您必须以列表的相反顺序构造 dict,然后再将其反转已构建去重子列表列表:

from collections import Counter
list({frozenset(Counter(lst).items()): lst for lst in reversed(l1)}.values())[::-1]

This returns:这将返回:

[['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog']]

This:这个:

l1 = [['The', 'quick', 'brown', 'fox'], ['hi', 'there'], ['jumps', 'over', 'the', 'lazy', 'dog'], ['there', 'hi'], ['jumps', 'dog', 'over','lazy', 'the']]
s = {tuple(item) for item in map(sorted, l1)}
l2 = [list(item) for item in s]

l2 gives the list with reverse duplicates removed. l2 给出删除了反向重复项的列表。 Compare with: Pythonic way of removing reversed duplicates in list比较: 删除列表中反向重复项的 Pythonic 方式

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM