简体   繁体   中英

Efficient way to find differences between two large groups in Java

In my case, I need to compare two large HashSets to find the differences using removeAll. To do that, I have to bring all the data from different data sources into memory then do the comparison. This creates Out of Memory issue when each HashSet may contain over 3 million records. Is there any ways or libraries that have less memory consumption but also achieve the same results?

Note that if the data is sorted, you can do it while streaming the data in a single pass, using a very low amount of extra memory:

i <- 0
j <- 0
while i < list1.size() and j < list2.size():
    if list1[i] == list2[j]:
        i <- i+1
        j <- j+1
    else if list1[i] < list2[j]: //i definetly not in list2
        yield list[i]
        i <- i+1
    else: // j is not in list1
        yield list[j]
        j <- j+1
yield all elements in list1 from i to list1.size() if there is any
yield all elements in list2 from j to list2.size() if there is any

Another alternative using hashing requires loading only one list (assuming here the data are sets, as mentioned in the question, so no dupe handling is needed):

load list1 as hash1
for each x in list2:
    if x is in hash1:
         hash1.remove(x)
    else:
         yield x
yield all remaining elements in hash1

Note that you can split the data and do the second approach iteratively if one list also does not fit in memory.

What you need from your description is a hash join which is used in databases: What is the difference between a hash join and a merge join (Oracle RDBMS )?

In short, to reduce memory consumption you could partition your data by a hash value. Very basic example: take a hash frame, ie hashes between between some values h1 and h2 from both sets and compare them. Then compare objects with hashes between h2 and h3 , etc. These h1 , h2 , ... hN might be easy to find just by doing

h[i] = i * ((long) Integer.MAX_VALUE - Integer.MIN_VALUE) / N; 

or not - it depends on the data and on the hash function you have.

This solution requires O(DB_SIZE / N) memory and O(DB_SIZE * N) record fetching operations. So with N = 4 this would scan the database 4 times and reduce memory consumption 4 times.

You can first filter out by non-common MyRecord.hashCode() s in a Set<Integer> and then use Set<MyRecord> .

// Determine common hashCodes:

Set<Integer> hashCodes = new HashSet<>();
for (MyRecord record : readFirstTable()) {
    hashCodes.add(record.hashCode();
}

Set<Integer> commonHashCodes = new HashSet<>();
for (MyRecord record : readSecondTable()) {
    int hashCode = record.hashCode();
    if (hashCodes.remove(hashCode)) {
        commonHashCodes.add(hashCode);
    }
}
hashCodes = null;

// Determine common records:

Set<MyRecord> records = new HashSet<>();
for (MyRecord record : readFirstTable()) {
    if (commonHashCodes.contains(record.hashCode()) {
        records.add(record);
    }
}
Set<MyRecord> commonRecords = new HashSet<>();
for (MyRecord record : readSecondTable()) {
    if (records.remove(record) {
        commonRecords.add(record);
    }
}
commonHashCodes = null;
records = null;
return commonRecords;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM