简体   繁体   中英

The Most efficient way to compare two huge arrays of objects

I want to compare two huge arrays, I'm reading those two arrays in batches (getting 10 objects per time from each array). After complete reading those two arrays, I want to have the following data (The intersection between the two huge arrays - Objects exist in the first array only -Objects exist in the second array only). What is the best practice to do that?

Example in small scale:

let arr1 = [ obj1, obj2, obj3, obj4, obj5, obj6, obj7];

let arr2 = [ obj7, obj2, obj5, obj1, obj9, obj8];

Then I will read the two arrays in batches (two elements per time):

First loop

->obj2 is mutual

->obj1 exist in arr1 only

->obj7 exist in arr2 only

The issue here, it is not the final result until I complete looping on the whole arrays to get the correct result which is:

The mutual objects are obj1,obj2,obj5,obj7

Objects in arr1 only are obj3,obj4,obj6

Objects in arr2 only are obj8,obj9

Note: I've to read the arrays in batches because they are too big.

In order to compare your arrays efficiently, you need to sort them out in some way. This is true whether or not the arrays are too big to fit into memory.

Conventionally, there are two choices: either sort the objects in each array and compare them in order, or hash the objects in each array and compare them with a hash map.

Each method has techniques to handle data too big to fit in memory. For sorting, there are "external" sorting algorithms not limited by memory size, and simple data streaming for comparison. For hashing, you can partition the data (according to hash) into bins small enough to process in-memory.


As an example, consider this Python-like pseudocode for hash-binning your data items:

// split data into bins
files = []
for i in 0 .. N-1:
    files.push_back(open_for_write("{filename}_bin{i}"))
for item in read_items(open_for_read(filename)):
    bin = item.hash() mod N
    write_item(item, files[bin])

You can do this for both your input files, then process them by bin:

// compare by bin
outfile = open_for_write(out_filename)
for i in 0 .. N-1:
    items = new_set()
    for item in read_items(open_for_read("{in_filename_1}_bin{i}")):
        items.insert(item)
    for item in read_items(open_for_read("{in_filename_2}_bin{i}")):
        if item in items:
            write_item(item, outfile)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM