How can I remove duplicates in a list, keep the original order of the items and remember the first index of any item in the list?
For example, removing the duplicates from [1, 1, 2, 3]
yields [1, 2, 3]
but I need to remember the indices [0, 2, 3]
.
I am using Python 2.7.
I'd tackle this a little differently and use an OrderedDict
and the fact that a lists index
method will return the lowest index of an item.
>>> from collections import OrderedDict
>>> lst = [1, 1, 2, 3]
>>> d = OrderedDict((x, lst.index(x)) for x in lst)
>>> d
OrderedDict([(1, 0), (2, 2), (3, 3)]
If you need the list (with its duplicates removed) and the indices separately, you can simply issue:
>>> d.keys()
[1, 2, 3]
>>> d.values()
[0, 2, 3]
Use enumerate
to keep track of the index and a set to keep track of element seen:
l = [1, 1, 2, 3]
inds = []
seen = set()
for i, ele in enumerate(l):
if ele not in seen:
inds.append(i)
seen.add(ele)
If you want both:
inds = []
seen = set()
for i, ele in enumerate(l):
if ele not in seen:
inds.append((i,ele))
seen.add(ele)
Or if you want both in different lists:
l = [1, 1, 2, 3]
inds, unq = [],[]
seen = set()
for i, ele in enumerate(l):
if ele not in seen:
inds.append(i)
unq.append(ele)
seen.add(ele)
Using a set is by far the best approach:
In [13]: l = [randint(1,10000) for _ in range(10000)]
In [14]: %%timeit
inds = []
seen = set()
for i, ele in enumerate(l):
if ele not in seen:
inds.append((i,ele))
seen.add(ele)
....:
100 loops, best of 3: 3.08 ms per loop
In [15]: timeit OrderedDict((x, l.index(x)) for x in l)
1 loops, best of 3: 442 ms per loop
In [16]: l = [randint(1,10000) for _ in range(100000)]
In [17]: timeit OrderedDict((x, l.index(x)) for x in l)
1 loops, best of 3: 10.3 s per loop
In [18]: %%timeit
inds = []
seen = set()
for i, ele in enumerate(l):
if ele not in seen:
inds.append((i,ele))
seen.add(ele)
....:
10 loops, best of 3: 22.6 ms per loop
So for 100k
elements 10.3
seconds vs 22.6 ms
, if you try with anything larger with less dupes like [randint(1,100000) for _ in range(100000)]
you will have time to read a book. Creating two lists is marginally slower but still orders of magnitude faster than using list.index.
If you want to get a value at a time you can use a generator function:
def yield_un(l):
seen = set()
for i, ele in enumerate(l):
if ele not in seen:
yield (i,ele)
seen.add(ele)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.