简体   繁体   中英

Remove unique values from a list and keep only duplicates

I'm looking to run over a list of ids and return a list of any ids that occurred more than once. This was what I set up that is working:

singles = list(ids)
duplicates = []
while len(singles) > 0:
    elem = singles.pop()
    if elem in singles:
        duplicates.append(elem)

But the ids list is likely to get quite long, and I realistically don't want a while loop predicated on an expensive len call if I can avoid it. (I could go the inelegant route and call len once, then just decrement it every iteration but I'd rather avoid that if I could).

The smart way to do this is to use a data structure that makes it easy and efficient, like Counter :

>>> ids = [random.randrange(100) for _ in range(200)]
>>> from collections import Counter
>>> counts = Counter(ids)
>>> dupids = [id for id in ids if counts[id] > 1]

Building the Counter takes O(N) time, as opposed to O(N log N) time for sorting, or O(N^2) for counting each element from scratch every time.


As a side note:

But the ids list is likely to get quite long, and I realistically don't want a while loop predicated on an expensive len call if I can avoid it.

len is not expensive. It's constant time, and (at least on builtin types list list ) it's about as fast as a function can possibly get in Python short of doing nothing at all.

The part of your code that's expensive is calling elem in singles inside the loop—that means for every element, you have to compare it against potentially every other element, meaning quadratic time.

You could do like this,

>>> ids = [1,2,3,2,3,5]
>>> set(i for i in ids if ids.count(i) > 1)
{2, 3}

I presume this will work faster:

occasions = {}
for id in ids:
    try:
        occasions[id] += 1
    except KeyError:
        occasions[id] = 0
result = [id for id in ids if occasions[id] > 1]

Or use itertools.groupby :

>>> l=[1,1,2,2,2,3]
>>> from itertools import groupby
>>> print([key for key,group in groupby(l) if len(list(group)) > 1])
[1, 2]
>>> 

Just check if the group (in loop) is bigger than one, if it is keep it, otherwise don't.

Or use pandas :

>>> import pandas as pd
>>> s=pd.Series(l)
>>> s[s.duplicated()].unique().tolist()
[1, 2]
>>> 

It's very fast, because pandas is super fast.

Documentation:

https://pandas.pydata.org/pandas-docs/stable/10min.html

Put cursor on the yellow part to view links.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.duplicated.html#pandas.Series.duplicated , and https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html#pandas.Series.unique

If you don't care about the order in which these ids are retrieved, an efficient approach would consist in a sorting step (which is O(N log(N))) followed by keeping ids that are followed by themselves (which is O(N)). So this approach is overall O(N log(N)).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM