简体   繁体   中英

Python: Remove duplicates for a specific item from list

I have a list of item, where I want to remove the occurrence of any duplicates for one item, but keep any duplicates for the rest. Ie I start with the following list

mylist = [4, 1, 2, 6, 1, 0, 9, 8, 0, 9]

I want to remove any duplicates of 0 but keep the duplicates of 1 and 9 . My current solution is the following:

mylist = [i for i in mylist if i != 0]
mylist.add(0)

Is there a nice way of keeping one occurrence of 0 besides the following?

for i in mylist:
    if mylist.count(0) > 1:
        mylist.remove(0)

The second approach takes more than double the time for this example.

Clarification:

  • currently, I don't care about the order of items in the list, as I currently sort it after it has been created and cleaned, but that might change later.

  • currently, I only need to remove duplicates for one specific item (that is 0 in my example)

The solution:

[0] + [i for i in mylist if i]

looks good enough, except if 0 is not in mylist , in which case you're wrongly adding 0.

Besides, adding 2 lists like this isn't very good performance wise. I'd do:

newlist = [i for i in mylist if i]
if len(newlist) != len(mylist):  # 0 was removed, add it back
   newlist.append(0)

(or using filter newlist = list(filter(None,mylist)) which could be slightly faster because there are no native python loops)

Appending to a list at the last position is very efficient ( list object uses pre-allocation and most of the time no memory is copied). The length test trick is O(1) and allows to avoid to test 0 in mylist

If performance is an issue and you are happy to use a 3rd party library, use numpy .

Python standard library is great for many things. Computations on numeric arrays is not one of them.

import numpy as np

mylist = np.array([4, 1, 2, 6, 1, 0, 9, 8, 0, 9])

mylist = np.delete(mylist, np.where(mylist == 0)[0][1:])

# array([4, 1, 2, 6, 1, 0, 9, 8, 9])

Here the first argument of np.delete is the input array. The second argument extracts the indices of all occurrences of 0, then extracts the second instance onwards.

Performance benchmarking

Tested on Python 3.6.2 / Numpy 1.13.1. Performance will be system and array specific.

%timeit jp(myarr.copy())         # 183 µs
%timeit vui(mylist.copy())       # 393 µs
%timeit original(mylist.copy())  # 1.85 s

import numpy as np
from collections import Counter

myarr = np.array([4, 1, 2, 6, 1, 0, 9, 8, 0, 9] * 1000)
mylist = [4, 1, 2, 6, 1, 0, 9, 8, 0, 9] * 1000

def jp(myarr):
    return np.delete(myarr, np.where(myarr == 0)[0][1:])

def vui(mylist):
    return [0] + list(filter(None, mylist))

def original(mylist):
    for i in mylist:
        if mylist.count(0) > 1:
            mylist.remove(0)

    return mylist

It sounds like a better data structure for you to use would be collections.Counter (which is in the standard library):

import collections

counts = collections.Counter(mylist)
counts[0] = 1
mylist = list(counts.elements())

Here is a generator-based approach with approximately O(n) complexity that also preserves the order of the original list:

In [62]: def remove_dup(lst, item):
    ...:     temp = [item]
    ...:     for i in lst:
    ...:         if i != item:
    ...:             yield i
    ...:         elif i == item and temp:
    ...:             yield temp.pop()
    ...:             

In [63]: list(remove_dup(mylist, 0))
Out[63]: [4, 1, 2, 6, 1, 0, 9, 8, 9]

Also if you are dealing with larger lists you can use following vectorized and optimized approach using Numpy:

In [80]: arr = np.array([4, 1, 2, 6, 1, 0, 9, 8, 0, 9])

In [81]: mask = arr == 0

In [82]: first_ind = np.where(mask)[0][0]

In [83]: mask[first_ind] = False

In [84]: arr[~mask]
Out[84]: array([4, 1, 2, 6, 1, 0, 9, 8, 9])

Slicing should do

a[start:end] # items start through end-1
a[start:]    # items start through the rest of the list
a[:end]      # items from the beginning through end-1
a[:]         # a copy of the whole list

Input:

mylist = [4,1, 2, 6, 1, 0, 9, 8, 0, 9,0,0,9,2,2,]
pos=mylist.index(0)
nl=mylist[:pos+1]+[i  for i in mylist[pos+1:] if i!=0]

print(nl)

Output: [4, 1, 2, 6, 1, 0, 9, 8, 9, 9, 2, 2]

You can use this:

desired_value = 0
mylist = [i for i in mylist if i!=desired_value] + [desired_value]

You can now change your desired value, you can also make it as a list like this

desired_value = [0, 6]
mylist = [i for i in mylist if i not in desired_value] + desired_value

也许你可以使用filter

[0] + list(filter(lambda x: x != 0, mylist))

You can use an itertools.count counter which will return 0, 1, ... each time it is iterated on:

from itertools import count

mylist = [4, 1, 2, 6, 1, 0, 9, 8, 0, 9]

counter = count()

# next(counter) will be called each time i == 0
# it will return 0 the first time, so only the first time
# will 'not next(counter)' be True
out = [i for i in mylist if i != 0 or not next(counter)]
print(out)

# [4, 1, 2, 6, 1, 0, 9, 8, 9]

The order is kept, and it can be easily modified to deduplicate an arbitrary number of values:

from itertools import count

mylist = [4, 1, 2, 6, 1, 0, 9, 8, 0, 9]

items_to_dedup = {1, 0}
counter = {item: count() for item in items_to_dedup}

out = [i for i in mylist if i not in items_to_dedup or not next(counter[i])]
print(out)

# [4, 1, 2, 6, 0, 9, 8, 9]

here's on line for it: where m is number to be occured once,and the order is kept

[x for i,x in enumerate(mylist) if mylist.index(x)==i or x!=m]

Result

[4, 1, 2, 6, 1, 0, 9, 8, 9]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM