简体   繁体   中英

How to remove duplicates, by sublist item subset, in a list of lists in Python?

I have a list of lists in Python that is defined like this: [[2, 3, 5], [3, 3, 1], [2, 3, 8]] , not I want to delete the duplicate entries, but by duplicate I mean that the first two elements of each list match, for example, the first and third list have 2 and 3 as their first and second elements, therefore, I count it as a duplicate, and after removing I want to have the final list: [[2, 3, 5], [3, 3, 1]] . Currently, I have something like this:

arr = [[2, 3, 5], [3, 3, 1], [2, 3, 8]]

first = [item[0] for item in arr]
second = [item[1] for item in arr]
zipped = zip(first, second)

This produces list of tuples with the first two entries of each list. Now, I can try to get the index of duplicate entries and remove those indices from the original list. But, are there shorter ways to do what I want? If not, what is the best way to get the duplicate indices here?

Solution

You can use sets to accomplish this:

arr = [[2, 3, 5], [3, 3, 1], [2, 3, 8]]

used = set()
[used.add(tuple(x[:2])) or x for x in arr if tuple(x[:2]) not in used]

returns

[[2, 3, 5], [3, 3, 1]]

Notes

  1. The first expression is only evaluated if the first two elements of any sublist are not in used . Checkout the docs on list comprehensions for more info.
  2. Know that set.add always returns None . So used.add(tuple(x[:2])) or x always evaluates to x .
  3. We need to convert the first two elements of a sublist to an immutable (eg tuple) since list is not hashable.

Finally as @wim brings up if you're not familiar with this pattern it can be difficult to understand and in Python "Readability counts." So if you're writing code that will be shared consider changing this to an explicit for loop or using another approach.

You can use collections.OrderedDict for an order-preserving de-dupe:

>>> d = OrderedDict(((x[0], x[1]), x) for x in reversed(L))
>>> print(*d.values())
[2, 3, 5] [3, 3, 1]

To keep the last instead of the first, just remove the reversed :

>>> OrderedDict(((x[0], x[1]), x) for x in L).values()
odict_values([[2, 3, 8], [3, 3, 1]])

Or use a plain old for-loop:

def dedupe(iterable):
    seen = set()
    for x in iterable:
        first, second, *rest = x
        if (first, second) not in seen:
            seen.add((first, second))
            yield x

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM