简体   繁体   中英

how to remove duplicates in a huge python list

I have a huge python list, about 100 MB size with strings and integers. I have some strings as triplicates and duplicates. I have tried to remove duplicates with this code:

from collections import OrderedDict

duplicates = [.......large size list of 100 MB....]

remove = OrderedDict.fromkeys(duplicates).keys()

print remove

I have done with small size lists and it works good, with this large list, it has taken me a whole day and am not yet done. Any suggestions on how this can be done in minutes, ..fewer hrs??. I have tried CUDA installation in Ubuntu to work it out but I keep getting errors: see here

Not sure if this is efficient enough, but one simple way to solve it is to cast your list into a set.

def unique(objects):
    return list(sorted(set(objects)))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM