What is the fastest way to find orphans between two large (size ~900K ) Vectors of Strings in Java?

Question

I'm currently working on a Java program that is required to handle large amounts of data. I have two Vectors...

        Vector collectionA = new Vector();
        Vector collectionB = new Vector();

...and both of them will contain around 900,000 elements during processing.

I need to find all items in collectionB that are not contained in collectionA. Right now, this is how I'm doing it:

        for (int i=0;i<collectionA.size();i++) {
            if(!collectionB.contains(collectionA.elementAt(i))){
                // do stuff if orphan is found
            }
        }

But this causes the program to run for lots of hours, which is unacceptable.

Is there any way to tune this so that I can cut my running time significantly?

I think I've read once that using ArrayList instead of Vector is faster. Would using ArrayLists instead of Vectors help for this issue?

Answer 1

Use a HashSet for the lookups.

Explanation:

Currently your program has to test every item in collectionB to see if it is equal to the item in collectionA that it is currently handling (the contains() method will need to check each item).

You should do:

Set<String> set = new HashSet<String>(collectionB);

for (Iterator i = collectionA.iterator(); i.hasNext(); ) {
  if (!set.contains(i.next())) {
    // handle
  }
}

Using the HashSet will help, because the set will calculate a hash for each element and store the element in a bucket associated with a range of hash values. When checking whether an item is in the set, the hash value of the item will directly identify the bucket the item should be in. Now only the items in that bucket have to be checked.

Using a SortedSet like TreeSet would also be an improvement over Vector , since to find the item, only the position it would be in has tip be checked, instead of all positions. Which Set implementation would perform best depends on the data.

Answer 2

If ordering of the elements doesn't matter, I would go for HashSets , and do it as follows:

Set<String> a = new HashSet<>();
Set<String> b = new HashSet<>();

// ...

b.removeAll(a):

So in essence, you're removing from set b all the elements that are in set a , leaving the asymmetric set difference . Note that the removeAll method does modify set b , so if that's not what you want, you would need to make a copy first.

To find out whether HashSet or TreeSet is more efficient for this type of operation, I ran the below code with both types, and used Guava's Stopwatch to measure execution time.

@Test
public void perf() {
    Set<String> setA = new HashSet<>();
    Set<String> setB = new HashSet<>();

    for (int i=0; i < 900000; i++) {
        String uuidA = UUID.randomUUID().toString();
        String uuidB = UUID.randomUUID().toString();

        setA.add(uuidA);
        setB.add(uuidB);
    }

    Stopwatch stopwatch = Stopwatch.createStarted();
    setB.removeAll(setA);

    System.out.println(stopwatch.elapsed(TimeUnit.MILLISECONDS));
}

On my modest development machine, using Oracle JDK 7, the TreeSet variant is about 4 times slower (~450ms) than the HashSet variant (~105ms).

What is the fastest way to find orphans between two large (size ~900K ) Vectors of Strings in Java?

Question

2 answers

solution1
3 2014-10-10 02:38:27

solution2
1 2014-10-10 02:41:22

What is the fastest way to find orphans between two large (size ~900K ) Vectors of Strings in Java?

Question

2 answers

solution1 3 2014-10-10 02:38:27

solution2 1 2014-10-10 02:41:22

solution1
3 2014-10-10 02:38:27

solution2
1 2014-10-10 02:41:22