I have a very large list of lists (containing 13 lists of ~41 million entries each, ie ~500 million entries in total, each one a short string). I need to take that list and find the union of two of the sublists, ie find all unique elements across them and save them into a new list in the most memory efficient way. Ordering is not essential. One way would be:
c = a[1] + a[2]
c = set(c)
But is that the most memory efficient way? An added complication is the fact that some of the entries in a[1]
or a[2]
may contain more than one element (ie look something like a[1]=[['val1'],['val2','val3'],...]
). How would I best deal with that so that val2
and val3
show up as separate entries in the final results?
I wouldn't be 100% sure that doing this is the most memory efficent way to do it but it I'd find it simplest:
l3 = set(l1)
l3.update(l2)
l3 = list(l3)
this should not allocate more memory than necessary:
l3 = []
for i in l1:
if i not in l3:
l3.append(i)
for i in l2:
if i not in l3:
l3.append(i)
Sets are most efficient than numpy here for short strings. with these data:
import random
N=13 #13
M=100000
ll=[["".join([l for l in [ 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'[randint(0,25)]\
for k in range(4)]]) for l in range(M)] for h in range(N)] #example data
just do :
sets= [set(l) for l in ll]
res = [[list(sets[i]|sets[j]) for j in range(i+1,N)] for i in range(N)]
it will take a few minutes for the result for M=41 000 000 , if no problem of memory size.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.