How much faster will a .contains() method be for a Hashtable<ArrayList<String>,boolean> than an ArrayList<ArrayList<String>>?

Question

I basically am doing the following:

Dumping an entire row of data from a DB table as Strings into an ArrayList< ArrayList< String>> .
Doing the same thing for another DB table.
Finding all the rows ArrayList< String> in the first DB in the second one by iterating across it and doing a.contains(b.get(i)) . If the contains is true then I do a.remove(b.get(i))

Now, how much faster would it be if I instead used an Hashtable< Arraylist< String>> instead of the ArrayList mentioned above using a.containsKey(i.getKey()) where i is an iterator over b and then removing by using i.remove ? Will it be a good enough increase to make the change?

Also, would using a Hashmap be more prudent? If so why...

Answer 1

My bottom-up answer:

The difference between Hashtable and HashMap has been (thoroughly) discussed in Differences between HashMap and Hashtable? . Short summary: HashMap is more efficient and should be used instead of Hashtable.
Finding data in a hash data structure (the contains() and remove() operations) is of the order O(log2) - that is, it is proportional to the 2-logarithm of the number of data points in the structure. If there are 4 data elements it takes X time; if there are 8 elements it takes 2X time, 16 elements, 3X time and so on. The data access time of hash structures grows very slowly.
Finding data in a list is of the order O(N) - that is, directly proportional to the number of elements in the list. 1 element takes Y time, 2 elements takes 2Y time, 4 elements takes 4Y time and so on. So the time consumption grows linearly with the size of the list.
So: if you have to find a large number of elements randomly from a data structure, a hash data structure is the best choice, as long as:
- the data has a decent hashCode() implementation (the one for ArrayList is OK)
- the data has hashCode() and equals() implementations that match each other, ie. if a.equals(b) then a.hashCode() == b.hashCode(). This is also true for ArrayList.
If, on the other hand, you're working with ordered data, there are other algorithms that can reduce the search and remove time substantially. If the data in the database is indexed it may be worthwhile to use ORDER BY when fetching the data and then use an algorithm for ordered data.

To summarize: use HashMap instead of ArrayList for list a.

I wrote a small program to benchmark the problem. Results first: program ran on Sun JVM 1.6.0_41 for Windows 7, 32 bits, on a Core i5 2.40 GHz CPU. Printout:

For 1000 words: List: 1 ms, Map: 2 ms
For 5000 words: List: 15 ms, Map: 12 ms
For 10000 words: List: 57 ms, Map: 12 ms
For 20000 words: List: 217 ms, Map: 37 ms
For 30000 words: List: 485 ms, Map: 45 ms
For 50000 words: List: 1365 ms, Map: 61 ms

The performance characteristics reveal themselves pretty well in a simple test like this. I ran the map version with more data and got the following:

For 100000 words: List: - ms, Map: 166 ms
For 500000 words: List: - ms, Map: 1130 ms
For 1000000 words: List: - ms, Map: 3540 ms

Finally the benchmarking code:

public void benchmarkListVersusMap() {
    for (int count : new int[]{1000, 5000, 10000, 20000, 30000, 50000}) {
        // Generate random sample data
        List<List<String>> words = generateData(count, 10, count);

        // Create ArrayList
        List<List<String>> list = new ArrayList<List<String>>();
        list.addAll(words);

        // Create HashMap
        Map<List<String>, Boolean> map = new HashMap<List<String>, Boolean>();
        for (List<String> row : words) {
            map.put(row, true);
        }

        // Measure:
        long timer = System.currentTimeMillis();
        for (List<String> row: words) {
            if (list.contains(row)) {
                list.remove(row);
            }
        }
        long listTime = System.currentTimeMillis() - timer;
        timer = System.currentTimeMillis();
        for (List<String> row : words) {
            if (map.containsKey(row)) {
                map.remove(row);
            }
        }
        long mapTime = System.currentTimeMillis() - timer;
        System.out.printf("For %s words: List: %s ms, Map: %s ms\n", count, listTime, mapTime);
    }
}

private List<List<String>> generateData(int rows, int cols, int noOfDifferentWords) {
    List<List<String>> list = new ArrayList<List<String>>(rows);
    List<String> dictionary = generateRandomWords(noOfDifferentWords);
    Random rnd = new Random();
    for (int row = 0; row < rows; row++) {
        List<String> l2 = new ArrayList<String>(cols);
        for (int col = 0; col < cols; col++) {
            l2.add(dictionary.get(rnd.nextInt(noOfDifferentWords)));
        }
        list.add(l2);
    }
    return list;
}

private static final String CHARS = "abcdefghijklmnopqrstuvwxyz0123456789";
private List<String> generateRandomWords(int count) {
    Random rnd = new Random();
    List<String> list = new ArrayList<String>(count);
    while (list.size() < count) {
        StringBuilder sb = new StringBuilder(20);
        for (int i = 0; i < 10; i++) {
            sb.append(CHARS.charAt(rnd.nextInt(CHARS.length())));
        }
        list.add(sb.toString());
    }
    return list;
}

Answer 2

Little excerpt from the Javadoc comment of ArrayList :

The size, isEmpty, get, set, iterator, and listIterator operations run in constant time. The add operation runs in amortized constant time, that is, adding n elements requires O(n) time. All of the other operations run in linear time (roughly speaking). The constant factor is low compared to that for the LinkedList implementation.

That means, the get operation on your second list runs in constant time O(1), which should be ok from a performance point of view. But the contains and the remove operation (on the first list) run in linear time O(n). Calling these operations as many times as the second list's size could last very long, especially if both lists are large.

Using a hashing data structure for the first one would result in constant time - O(1) - for calling the operations contains and remove. I would suggest to use a HashSet for the first "list". But that only works, if all rows do not equal.

But you should always always do a profiling before trying to optimize something. First make sure you are optimizing the right place.

How much faster will a .contains() method be for a Hashtable<ArrayList<String>,boolean> than an ArrayList<ArrayList<String>>?

Question

2 answers

solution1
5 ACCPTED 2013-07-09 11:54:31

solution2
1 2013-07-09 09:04:12

How much faster will a .contains() method be for a Hashtable<ArrayList<String>,boolean> than an ArrayList<ArrayList<String>>?

Question

2 answers

solution1 5 ACCPTED 2013-07-09 11:54:31

solution2 1 2013-07-09 09:04:12

solution1
5 ACCPTED 2013-07-09 11:54:31

solution2
1 2013-07-09 09:04:12