简体   繁体   中英

Remove duplicates in each list of a list of lists

I have a list of lists:

a = [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
     [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0],
     [3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0],
     [1.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0],
     [5.0, 5.0, 5.0], 
     [1.0]
    ]

What I need to do is remove all the duplicates in the list of lists and keep the previous sequence. Such as

a = [[1.0],
     [2.0, 3.0, 4.0],
     [3.0, 5.0],
     [1.0, 4.0, 5.0],
     [5.0], 
     [1.0]
    ]

If order is important, you can just compare to the set of items seen so far:

a = [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
     [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0],
     [3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0],
     [1.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0],
     [5.0, 5.0, 5.0], 
     [1.0]]

for index, lst in enumerate(a):
    seen = set()
    a[index] = [i for i in lst if i not in seen and seen.add(i) is None]

Here i is added to seen as a side-effect, using Python's lazy and evaluation; seen.add(i) is only called where the first check ( i not in seen ) evaluates True .

Attribution: I saw this technique yesterday from @timgeb .

If you have access to the OrderedDict (in Python 2.7 on), abusing it a good way to do this:

import collections
import pprint

a = [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
     [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0],
     [3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0],
     [1.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0],
     [5.0, 5.0, 5.0], 
     [1.0]
    ]

b = [list(collections.OrderedDict.fromkeys(i)) for i in a]


pprint.pprint(b, width = 40)

Outputs:

[[1.0],
 [2.0, 3.0, 4.0],
 [3.0, 5.0],
 [1.0, 4.0, 5.0],
 [5.0],
 [1.0]]

This will help you.

a = [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
 [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0],
 [3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0],
 [1.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0],
 [5.0, 5.0, 5.0], 
 [1.0]
]

for _ in range(len(a)):
    a[_] = sorted(list(set(a[_]))) 

print a

OUTPUT:

[[1.0], [2.0, 3.0, 4.0], [3.0, 5.0], [1.0, 4.0, 5.0], [5.0], [1.0]]

Inspired by DOSHI, here's another way, probably best way for a small number of possible elements (ie a small number of index lookups for sorted) otherwise a way that remembers insertion order may be better:

b = [sorted(set(i), key=i.index) for i in a]

So just to compare the methods, a seen set versus sorting a set by an original index lookup:

>>> setup = 'l = [1,2,3,4,1,2,3,4,1,2,3,4]*100'
>>> timeit.repeat('sorted(set(l), key=l.index)', setup)
[23.231241687943111, 23.302754517266294, 23.29650511717773]
>>> timeit.repeat('seen = set(); [i for i in l if i not in seen and seen.add(i) is None]', setup)
[49.855933579601697, 50.171151882997947, 51.024657420945005]

Here we see that for a larger case where, the contain test that Jon uses for every element becomes relatively very costly, and since insertion order is quickly determined by index in this case, this method is much more efficient.

However, by appending more elements to the end of the list, we see that Jon's method does not bear much increased cost, whereas mine does:

>>> setup = 'l = [1,2,3,4,1,2,3,4,1,2,3,4]*100 + [8,7,6,5]'
>>> timeit.repeat('sorted(set(l), key=l.index)', setup)
[93.221347206941573, 93.013769266020972, 92.64512197257136]
>>> timeit.repeat('seen = set(); [i for i in l if i not in seen and seen.add(i) is None]', setup)
[51.042504915545578, 51.059295348750311, 50.979311841569142]

I think I'd prefer Jon's method with a seen set, given the bad lookup times for the index.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM