Efficient way to find differences between two large groups in Java

Question

In my case, I need to compare two large HashSets to find the differences using removeAll. To do that, I have to bring all the data from different data sources into memory then do the comparison. This creates Out of Memory issue when each HashSet may contain over 3 million records. Is there any ways or libraries that have less memory consumption but also achieve the same results?

Answer 1

Note that if the data is sorted, you can do it while streaming the data in a single pass, using a very low amount of extra memory:

i <- 0
j <- 0
while i < list1.size() and j < list2.size():
    if list1[i] == list2[j]:
        i <- i+1
        j <- j+1
    else if list1[i] < list2[j]: //i definetly not in list2
        yield list[i]
        i <- i+1
    else: // j is not in list1
        yield list[j]
        j <- j+1
yield all elements in list1 from i to list1.size() if there is any
yield all elements in list2 from j to list2.size() if there is any

Another alternative using hashing requires loading only one list (assuming here the data are sets, as mentioned in the question, so no dupe handling is needed):

load list1 as hash1
for each x in list2:
    if x is in hash1:
         hash1.remove(x)
    else:
         yield x
yield all remaining elements in hash1

Note that you can split the data and do the second approach iteratively if one list also does not fit in memory.

Answer 2

What you need from your description is a hash join which is used in databases: What is the difference between a hash join and a merge join (Oracle RDBMS )?

In short, to reduce memory consumption you could partition your data by a hash value. Very basic example: take a hash frame, ie hashes between between some values h1 and h2 from both sets and compare them. Then compare objects with hashes between h2 and h3 , etc. These h1 , h2 , ... hN might be easy to find just by doing

h[i] = i * ((long) Integer.MAX_VALUE - Integer.MIN_VALUE) / N;

or not - it depends on the data and on the hash function you have.

This solution requires O(DB_SIZE / N) memory and O(DB_SIZE * N) record fetching operations. So with N = 4 this would scan the database 4 times and reduce memory consumption 4 times.

Answer 3

You can first filter out by non-common MyRecord.hashCode() s in a Set<Integer> and then use Set<MyRecord> .

// Determine common hashCodes:

Set<Integer> hashCodes = new HashSet<>();
for (MyRecord record : readFirstTable()) {
    hashCodes.add(record.hashCode();
}

Set<Integer> commonHashCodes = new HashSet<>();
for (MyRecord record : readSecondTable()) {
    int hashCode = record.hashCode();
    if (hashCodes.remove(hashCode)) {
        commonHashCodes.add(hashCode);
    }
}
hashCodes = null;

// Determine common records:

Set<MyRecord> records = new HashSet<>();
for (MyRecord record : readFirstTable()) {
    if (commonHashCodes.contains(record.hashCode()) {
        records.add(record);
    }
}
Set<MyRecord> commonRecords = new HashSet<>();
for (MyRecord record : readSecondTable()) {
    if (records.remove(record) {
        commonRecords.add(record);
    }
}
commonHashCodes = null;
records = null;
return commonRecords;

Efficient way to find differences between two large groups in Java

Question

3 answers

solution1
2 2013-12-17 16:10:00

solution2
0 2013-12-17 16:47:25

solution3
0 2013-12-17 17:08:17

Efficient way to find differences between two large groups in Java

Question

3 answers

solution1 2 2013-12-17 16:10:00

solution2 0 2013-12-17 16:47:25

solution3 0 2013-12-17 17:08:17

solution1
2 2013-12-17 16:10:00

solution2
0 2013-12-17 16:47:25

solution3
0 2013-12-17 17:08:17