根据每个列表的子集从列表列表中删除重复项

Question

I wrote a function to remove "duplicates" from a list of list.我编写了一个函数来从列表列表中删除“重复项”。

The elements of my list are:我的清单的元素是：

[ip, email, phone number].

I would like to remove the sublists that got the same EMAIL and PHONE NUMBER, I don't really care about the IP address.我想删除具有相同 EMAIL 和 PHONE NUMBER 的子列表，我并不真正关心 IP 地址。

The solution that I currently use is :我目前使用的解决方案是：

def remove_duplicate_email_phone(data):
    for i in range(len(data)):
        for j in reversed(range(i+1,len(data))):
            if data[i][1] == data[j][1] and data[i][2] == data[j][2] :
                data.pop(j)
    return data

I would like to optimize this.我想优化这个。 It took more than 30 minutes to get the result.花了30多分钟才得到结果。

Answer 1

Your approach does a full scan for each and every element in the list, making it take O(N**2) (quadratic) time.您的方法对列表中的每个元素进行全面扫描，使其花费 O(N**2)（二次）时间。 The list.pop(index) is also expensive as everything following index is moved up, making your solution approach O(N**3) cubic time. list.pop(index)也很昂贵，因为index后面的所有内容都向上移动，使您的解决方案接近 O(N**3) 立方时间。

Use a set and add (email, phonenumber) tuples to it to check if you already have seen that pair;使用一个集合并在其中添加(email, phonenumber)元组以检查您是否已经看过该对； testing containment against a set takes O(1) constant time, so you can clean out dupes in O(N) total time:测试对集合的遏制需要 O(1) 恒定时间，因此您可以在 O(N) 总时间内清除欺骗：

def remove_duplicate_email_phone(data):
    seen = set()
    cleaned = []
    for ip, email, phone in data:
        if (email, phone) in seen:
            continue
        cleaned.append([ip, email, phone])
        seen.add((email, phone))
    return cleaned

This produces a new list, the old list is left untouched.这会生成一个新列表，旧列表保持不变。

Answer 2

Another solution might be to use groupby.另一种解决方案可能是使用 groupby。

from itertools import groupby
from operator import itemgetter

deduped = []

data.sort(key=itemgetter(1,2))
for k, v in groupby(data, key=itemgetter(1,2):
    deduped.append(list(v)[0])

or using a list comprehension:或使用列表理解：

deduped = [next(v) for k, v in groupby(data, key=itemgetter(1,2))]

Answer 3

Another approach could be to use a Counter另一种方法可能是使用Counter

from collections import Counter

data = [(1, "a@b.com", 1234), (1, "a@b.com", 1234), (2, "a@b.com", 1234)]
counts = Counter([i[:2] for i in data])

print [i for i in data if counts[i[:2]] == 1]  # Get unique

根据每个列表的子集从列表列表中删除重复项

问题描述

3 个解决方案

解决方案1
4 2016-10-17 10:32:37

解决方案2
0 2016-10-17 11:09:23

解决方案3
0 已采纳 2016-10-17 11:28:55

根据每个列表的子集从列表列表中删除重复项

问题描述

3 个解决方案

解决方案1 4 2016-10-17 10:32:37

解决方案2 0 2016-10-17 11:09:23

解决方案3 0 已采纳 2016-10-17 11:28:55

解决方案1
4 2016-10-17 10:32:37

解决方案2
0 2016-10-17 11:09:23

解决方案3
0 已采纳 2016-10-17 11:28:55