简体   繁体   中英

Need to remove duplicates from a nested list preserving the order

I have a list like below

L = [[1,2],[1,3],[1,4],[2,1],[2,5],[3,1],[3,2]]

The output should be

[[1,2],[1,3],[1,4],[2,5],[3,2]]

Please note that the order for the pairs and the elements has to be preserved. In other words, I cannot sort the list because the elements order needs to be preserved for the pairs as well. For example, I need the last element [3,2] to be preserved. If i sort them and remove duplicates this will be changed to [2,3] which I do not want. Also when I say I need to remove duplicates [1,2] or [2,1] is considered to be a duplicate and I want to preserve [1,2] from this.

The unique_everseen function in the itertools recipes in the docs does exactly what you need:

>>> lst = [[1,2],[1,3],[1,4],[2,1],[2,5],[3,1],[3,2]]
>>> list(unique_everseen(lst, key=frozenset))
[[1, 2], [1, 3], [1, 4], [2, 5], [3, 2]]

The basic idea is that it keeps a set of all values seen so far, and skips any value that's already in the set.

The key function works the same way as in sort , max , etc., as explained in the Sorting HOWTO . You want to make two lists that have the same values in a different order match, so we need to compare the set of each list's values, instead of the lists themselves. (The reason we need frozenset instead of set is that set is mutable, and therefore can't be stored in a set.)


If you had more than 2 elements in your sublists, the question would be ambiguous. If you have, say, [1, 1, 2] and [1, 2, 2] , do you want them to be considered duplicates, or not?

  • If yes: Then you're treating them a set, so use key=frozenset .
  • If no: Then you're treating them as a multiset. The nicest Python implementation of multisets is collections.Counter , but there's no FrozenCounter (and building one just for this purpose is probably overkill). You can simulate one in a few ways:
    • key=lambda sublist: frozenset(Counter(sublist).items())
    • key=lambda sublist: sorted(Counter(sublist).items())
    • key=lambda sublist: tuple(sorted(sublist))

Since your initial thought was to sort the sublists—which was only unacceptable because you need to end up with the original value, not the sorted value—I think the last of these options is most likely to be the one you'd want, but that's still really just a guess.


You can just copy and paste the function from the docs into your code:

from itertools import *

def unique_everseen(iterable, key=None):
    "List unique elements, preserving order. Remember all elements ever seen."
    # unique_everseen('AAAABBBCCDAABBB') --> A B C D
    # unique_everseen('ABBCcAD', str.lower) --> A B C D
    seen = set()
    seen_add = seen.add
    if key is None:
        for element in filterfalse(seen.__contains__, iterable):
            seen_add(element)
            yield element
    else:
        for element in iterable:
            k = key(element)
            if k not in seen:
                seen_add(k)
                yield element

… or install the third-party library more_itertools and use its unique_everseen it from there. Or the different third-party library toolz has an equivalent function named unique .

This works similar to removing duplicates from a regular list whilst preserving order , but we need to do something about the sublists not being hashable and the order of the sublists being irrelevant.

We can use frozen sets to solve both these issues at once.

>>> lst = [[1,2],[1,3],[1,4],[2,1],[2,5],[3,1],[3,2]] 
>>> seen = set()
>>> result = []
>>> 
>>> for x in lst:
...     s = frozenset(x)
...     if s not in seen:
...         result.append(x)
...         seen.add(s)
... 
>>> result
[[1, 2], [1, 3], [1, 4], [2, 5], [3, 2]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM