Python：高效，优雅地从大型列表中删除所有重复项

Question

我有一个xy坐标列表作为列表：

print(xy[0:10])

[[104.44464000013596, 21.900339999891116],
 [9.574480000151937, 0.32839999976022227],
 [9.932610000251373, 0.19092000005798582],
 [9.821009999711748, 0.26556000039374794],
 [9.877130000349268, -0.6701499997226392],
 [149.51198999973872, -28.469329999879562],
 [149.35872999988965, -28.684280000021943],
 [9.859010000211413, -0.03293000041912819],
 [9.38918000035676, -0.9979400000309511],
 [77.35380000007001, 32.926530000359264]]

这里显示的是前10个，但列表中有约100,000个坐标对。

我想从此列表中删除所有重复的列表，但要有效。 作为一个更容易理解的示例，我想创建一个函数remove_dupes ，它产生以下结果：

a = [[1, 2], [3, 4], [5, 6], [1, 2], [1, 2], [8, 9], [3, 4]]
b = remove_dupes(a)
print(b)
b = [[5, 6], [8 ,9]]

请注意，订单对于保存很重要。

但是，由于列表很大，因此我发现使用.count（）方法并遍历列表非常耗时。 我还尝试了set（）和numpy的独特功能的各种技巧。

这是我能想到的最快的版本：

xy = [[x1,y1], [x2,y2], ... [xn, yn]]

def remove_dupes(xy):

    xy = [tuple(p) for p in xy] # Tupelize coordinates list for hashing

    p_hash = [hash(tuple(p)) for p in xy] # Hash each coordinate-pair list to a single value

    counts = Counter(p_hash) # Get counts (dictionary) for each unique hash value

    p_remove = [key for (key, value) in counts.items() if value > 1] # Store keys with count > 1

    p_hash = np.array(p_hash) # Cast from list to numpy array 

    remove = np.zeros((len(xy),1), dtype=np.bool) # Initialize storage

    for p in p_remove: # Loop through each non-unique hash and set all the indices where it appears to True // Most time-consuming portion
        remove[np.where(p==p_hash)[0]] = True

    xy = np.array(xy) # Cast back to numpy array for indexing

    xy = xy[remove.flat==False, :]  # Keep only the non-duplicates

    return xy

大约需要100,000个值，这大约需要2秒钟（如果有更多重复的对，三元组等，则需要更长的时间）。 让我感到困扰的是，有像numpy.unique（）这样的函数可以在不到一秒钟的时间内返回计数和索引，但是我无法弄清楚如何使它们的输出一致以解决这个问题。 我浏览了其他类似的数十个Stackexchange帖子，但没有发现既高效又优雅的东西。 有没有人提出比我在这里提出的更优雅的解决方法的建议？

编辑：

我收到了两个提供正确结果（并保留顺序）的答案。 RafaelC提供了Pandas选项，而DYZ提供了Counter选项。 我不太熟悉如何正确计时，但是我两次都运行了100次迭代（在同一数据上），结果如下（使用time.time（））

熊猫：13.02秒

计数器：28.15秒

我不确定为什么Pandas的执行速度更快？ 一个区别是Pandas解决方案返回了元组（可以），因此我尝试了Counter解决方案而没有转换回列表，仍然是25秒。

Answer 1

我会用pandas

s = pd.Series(list(map(tuple, l)))
s[~s.duplicated(keep=False)].tolist()

需要

211 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

100000个条目，因此提高了10倍。

Answer 2

考虑使用计数器：

from collections import Counter

首先，将您的列表转换为元组，因为元组是不可变的。 然后对元组进行计数，并仅选择只发生一次的元组。 这是用于非重复项的集合：

nodups = {k for k,cnt in Counter(map(tuple, a)).items() if cnt == 1}

现在，由于顺序很重要，请针对非重复项过滤原始列表：

[list(k) for k in map(tuple, a) if k in nodups]
#[[5, 6], [8, 9]]

Answer 3

在Python 3.6+词典中，其字典顺序保持不变，因此DYZ的Counter解决方案可以通过以下方式得到极大改进：

[list(k) for k, c in Counter(map(tuple, a)).items() if c == 1]

在我的计算机上，它比pandas解决方案快。

RafaelC的熊猫解决方案也可以大大加速。 关键是从Series切换到DataFrame ：

s = pd.DataFrame(a)
return s[~s.duplicated(keep=False)].values.tolist()

在我的计算机上，它的速度几乎是原始熊猫解决方案的两倍。 加速的关键是避免在大熊猫（ list(map(tuple, l)) ）之外进行准备工作。

Answer 4

我有一个高效且内置的解决方案

import itertools
xy = [[104.44464000013596, 21.900339999891116],
 [9.574480000151937, 0.32839999976022227],
 [9.932610000251373, 0.19092000005798582],
 [9.821009999711748, 0.26556000039374794],
 [9.877130000349268, -0.6701499997226392],
 [149.51198999973872, -28.469329999879562],
 [149.35872999988965, -28.684280000021943],
 [9.859010000211413, -0.03293000041912819],
 [9.38918000035676, -0.9979400000309511],
 [77.35380000007001, 32.926530000359264]]

xy.sort() # sorting the data
sorted_data = list(xy for xy,_ in itertools.groupby(xy)) # grouping

注意：我已经测试了两种方法，分别是numpy和itertools 。 Numpy在长度为10000000的数据中花费了13秒，而intertool在长度为10000000的数据中花费了1秒

Python：高效，优雅地从大型列表中删除所有重复项

问题描述

4 个解决方案

解决方案1
3 已采纳 2018-09-22 00:13:10

解决方案2
2 2018-09-22 00:06:11

解决方案3
1 2018-09-24 19:27:03

解决方案4
-1 2018-09-22 00:24:04

Python：高效，优雅地从大型列表中删除所有重复项

问题描述

4 个解决方案

解决方案1 3 已采纳 2018-09-22 00:13:10

解决方案2 2 2018-09-22 00:06:11

解决方案3 1 2018-09-24 19:27:03

解决方案4 -1 2018-09-22 00:24:04

解决方案1
3 已采纳 2018-09-22 00:13:10

解决方案2
2 2018-09-22 00:06:11

解决方案3
1 2018-09-24 19:27:03

解决方案4
-1 2018-09-22 00:24:04