简体   繁体   English

如何有效地从 Python 的大列表中删除重复项?

[英]How can I efficiently remove duplicates from a large list in Python?

I need to remove every duplicate item from a list of more than 100 million things.我需要从超过 1 亿个事物的列表中删除每个重复项。 I tried converting the list to a set and back again using the Set method, but it is far too sluggish and slow and memory-intensive.我尝试使用 Set 方法将列表转换为集合并再次转换回来,但它太迟钝、太慢并且占用内存。 Are there any other effective solutions to achieve this?是否有其他有效的解决方案来实现这一目标?

If you're willing to sort your list, then this is fairly trivial.如果您愿意对列表进行排序,那么这就很简单了。 Sort it first, then take the unique items.首先对其进行排序,然后取出唯一的项目。 This is the same approach as sort | uniq这与sort | uniq的方法相同。 sort | uniq in shell, and can be quite memory efficient (using disk instead, of course, Python's built-in sort will be in-memory). sort | uniq在 shell 中,并且可以相当 memory 高效(使用磁盘代替,当然,Python 的内置排序将在内存中)。

Itertools Recipes Itertools 食谱

def unique_justseen(iterable, key=None):
    "List unique elements, preserving order. Remember only the element just seen."
    # unique_justseen('AAAABBBCCDAABBB') --> A B C D A B
    # unique_justseen('ABBcCAD', str.lower) --> A B c A D
    return map(next, map(operator.itemgetter(1), groupby(iterable, key)))

Is there a reason you care if this is sluggish?如果这是缓慢的,你有理由关心吗? If you need to do this operation often then something is wrong in the way you are handling data.如果您需要经常执行此操作,则说明您处理数据的方式有问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM