简体   繁体   中英

Find unique values across two very large lists in most memory efficient way

I have a very large list of lists (containing 13 lists of ~41 million entries each, ie ~500 million entries in total, each one a short string). I need to take that list and find the union of two of the sublists, ie find all unique elements across them and save them into a new list in the most memory efficient way. Ordering is not essential. One way would be:

c = a[1] + a[2]
c = set(c)

But is that the most memory efficient way? An added complication is the fact that some of the entries in a[1] or a[2] may contain more than one element (ie look something like a[1]=[['val1'],['val2','val3'],...] ). How would I best deal with that so that val2 and val3 show up as separate entries in the final results?

I wouldn't be 100% sure that doing this is the most memory efficent way to do it but it I'd find it simplest:

l3 = set(l1)
l3.update(l2)
l3 = list(l3)

this should not allocate more memory than necessary:

l3  = []
for i in l1:
  if i not in l3:
    l3.append(i)
for i in l2:
  if i not in l3:
    l3.append(i)

Sets are most efficient than numpy here for short strings. with these data:

import random
N=13 #13
M=100000
ll=[["".join([l for l in [ 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'[randint(0,25)]\
for k in range(4)]]) for l in range(M)] for h in range(N)] #example data

just do :

sets= [set(l) for l in ll]
res = [[list(sets[i]|sets[j]) for j in range(i+1,N)] for i in range(N)]

it will take a few minutes for the result for M=41 000 000 , if no problem of memory size.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM