I wrote a function to remove "duplicates" from a list of list.
The elements of my list are:
[ip, email, phone number].
I would like to remove the sublists that got the same EMAIL and PHONE NUMBER, I don't really care about the IP address.
The solution that I currently use is :
def remove_duplicate_email_phone(data):
for i in range(len(data)):
for j in reversed(range(i+1,len(data))):
if data[i][1] == data[j][1] and data[i][2] == data[j][2] :
data.pop(j)
return data
I would like to optimize this. It took more than 30 minutes to get the result.
Your approach does a full scan for each and every element in the list, making it take O(N**2) (quadratic) time. The list.pop(index)
is also expensive as everything following index
is moved up, making your solution approach O(N**3) cubic time.
Use a set and add (email, phonenumber)
tuples to it to check if you already have seen that pair; testing containment against a set takes O(1) constant time, so you can clean out dupes in O(N) total time:
def remove_duplicate_email_phone(data):
seen = set()
cleaned = []
for ip, email, phone in data:
if (email, phone) in seen:
continue
cleaned.append([ip, email, phone])
seen.add((email, phone))
return cleaned
This produces a new list, the old list is left untouched.
Another solution might be to use groupby.
from itertools import groupby
from operator import itemgetter
deduped = []
data.sort(key=itemgetter(1,2))
for k, v in groupby(data, key=itemgetter(1,2):
deduped.append(list(v)[0])
or using a list comprehension:
deduped = [next(v) for k, v in groupby(data, key=itemgetter(1,2))]
Another approach could be to use a Counter
from collections import Counter
data = [(1, "a@b.com", 1234), (1, "a@b.com", 1234), (2, "a@b.com", 1234)]
counts = Counter([i[:2] for i in data])
print [i for i in data if counts[i[:2]] == 1] # Get unique
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.