如何有效地从 Python 的大列表中删除重复项？

Question

I need to remove every duplicate item from a list of more than 100 million things.我需要从超过 1 亿个事物的列表中删除每个重复项。 I tried converting the list to a set and back again using the Set method, but it is far too sluggish and slow and memory-intensive.我尝试使用 Set 方法将列表转换为集合并再次转换回来，但它太迟钝、太慢并且占用内存。 Are there any other effective solutions to achieve this?是否有其他有效的解决方案来实现这一目标？

Answer 1

If you're willing to sort your list, then this is fairly trivial.如果您愿意对列表进行排序，那么这就很简单了。 Sort it first, then take the unique items.首先对其进行排序，然后取出唯一的项目。 This is the same approach as sort | uniq这与sort | uniq的方法相同。 sort | uniq in shell, and can be quite memory efficient (using disk instead, of course, Python's built-in sort will be in-memory). sort | uniq在 shell 中，并且可以相当 memory 高效（使用磁盘代替，当然，Python 的内置排序将在内存中）。

Itertools Recipes Itertools 食谱

def unique_justseen(iterable, key=None):
    "List unique elements, preserving order. Remember only the element just seen."
    # unique_justseen('AAAABBBCCDAABBB') --> A B C D A B
    # unique_justseen('ABBcCAD', str.lower) --> A B c A D
    return map(next, map(operator.itemgetter(1), groupby(iterable, key)))

Is there a reason you care if this is sluggish?如果这是缓慢的，你有理由关心吗？ If you need to do this operation often then something is wrong in the way you are handling data.如果您需要经常执行此操作，则说明您处理数据的方式有问题。

如何有效地从 Python 的大列表中删除重复项？

问题描述

1 个解决方案

解决方案1
2 2023-01-09 18:30:12

如何有效地从 Python 的大列表中删除重复项？

问题描述

1 个解决方案

解决方案1 2 2023-01-09 18:30:12

解决方案1
2 2023-01-09 18:30:12