Find unique values across two very large lists in most memory efficient way

Question

I have a very large list of lists (containing 13 lists of ~41 million entries each, ie ~500 million entries in total, each one a short string). I need to take that list and find the union of two of the sublists, ie find all unique elements across them and save them into a new list in the most memory efficient way. Ordering is not essential. One way would be:

c = a[1] + a[2]
c = set(c)

But is that the most memory efficient way? An added complication is the fact that some of the entries in a[1] or a[2] may contain more than one element (ie look something like a[1]=[['val1'],['val2','val3'],...] ). How would I best deal with that so that val2 and val3 show up as separate entries in the final results?

Answer 1

I wouldn't be 100% sure that doing this is the most memory efficent way to do it but it I'd find it simplest:

l3 = set(l1)
l3.update(l2)
l3 = list(l3)

this should not allocate more memory than necessary:

l3  = []
for i in l1:
  if i not in l3:
    l3.append(i)
for i in l2:
  if i not in l3:
    l3.append(i)

Answer 2

Sets are most efficient than numpy here for short strings. with these data:

import random
N=13 #13
M=100000
ll=[["".join([l for l in [ 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'[randint(0,25)]\
for k in range(4)]]) for l in range(M)] for h in range(N)] #example data

just do :

sets= [set(l) for l in ll]
res = [[list(sets[i]|sets[j]) for j in range(i+1,N)] for i in range(N)]

it will take a few minutes for the result for M=41 000 000 , if no problem of memory size.

Find unique values across two very large lists in most memory efficient way

Question

2 answers

solution1
1 2016-01-29 12:13:56

solution2
0 2016-01-29 15:31:37

Find unique values across two very large lists in most memory efficient way

Question

2 answers

solution1 1 2016-01-29 12:13:56

solution2 0 2016-01-29 15:31:37

solution1
1 2016-01-29 12:13:56

solution2
0 2016-01-29 15:31:37