Why is std::tr1::unordered_map slower than a homegrown hash map?

Question

I wrote a basic program that takes strings and counts the incidences of unique ones by inserting them into a string->integer hash map.

I use std::tr1::unordered_map for the storage, templated for a custom hash function and a custom equality function. The key type is actually char* rather than the too-slow std::string .

I then changed the same code to use a very, very simple hash table (really an array of {key, value} structures indexed by hash) with a power-of-two size and linear probing for collisions. The program got 33% faster.

Given that when I was using tr1::unordered_map I presized the hash table so it never had to grow, and that I was using exactly the same hash and comparison routines, what is tr1::unordered_map doing that slows it down by 50% as compared to the most basic hash map imaginable?

Code for the hash map type I'm talking about as "simple" here:

typedef struct dataitem {
    char* item;
    size_t count;
} dataitem_t;

dataitem_t hashtable[HASHTABLE_SIZE] = {{NULL,0}}; // Start off with empty table

void insert(char* item) {
    size_t hash = generate_hash(item);
    size_t firsthash = hash;
    while (true) {
        hash &= HASHTABLE_SIZE_MASK; // Bitmasking effect is hash %= HASHTABLE_SIZE
        if (hashtable[hash].item == NULL) { // Free bucket
            hashtable[hash].item = item;
            hashtable[hash].count = 1;
            break;
        }
        if (strcmp(hashtable[hash].item, item) == 0) { // Not hash collision; same item
            hashtable[hash].count += 1;
            break;
        }
        hash++; // Hash collision.  Move to next bucket (linear probing)
        if (hash == firsthash) {
            // Table is full.  This does not happen because the presizing is correct.
            exit(1);
        }
    }
}

Answer 1

I wish to extend @AProgrammer answer.

Your hash map is simple because it is custom tailored to your need. On the other hand std::tr1::unordered_map has to fulfill a number of different tasks, and do well in all case. This require a mean-performance approach in all cases, so it'll never be excellent in any particular area.

Hash containers are very special in that there are many ways to implement them, you chose Open-Addressing, while the standard forces a bucket approach on the implementors. Both have different trade-offs, and this is one reason why the standard, this time, actually enforced a particular implementation: so that performance do not change dramatically when switching from one library to another. Simply specifying Big-O complexity / amortized complexity would not have been enough here.

You say that you instructed the unordered_map as to the number of finals elements, but did you change the load factor ? Chaining is notoriously "bad" (because of the lack of memory locality) in case of collisions, and using a smaller load factor would favor spreading out your elements.

Finally, to point out one difference: what happens when you resize your hash map ? By using chaining, the unordered_map does not move the elements in memory:

references to them are still valid (even though the iterators may be invalidated)
in case of big or complex objects, there is no invocation of copy constructors

This is in contrast with your simple implementation , which would incur O(N) copies (unless you use linear rehashing to spread out the work, but this is definitely not simple).

It seems, therefore, that the choice for unordered_map was to smooth the spikes , at the cost of a slower average insert.

There is something you can do though: provide a custom allocator . By writing a specific allocator for your usecase, and allocate all its memory in one go (since you know how many objects will be inserted, and can have the allocator report how much memory is a node). Then allocate the nodes in a stack-like fashion (simple pointer increase). It should improve (somewhat) the performance.

Answer 2

Your "homegrown hash map" is not a hash map at all, it's an intrusive hash set. And that's the reason it's faster. Simple as that.

Well, actually intrusive hash set isn't exact either, but it's the closest match.

Answer 3

In general comparing speed of components not build to the same spec isn't fair.

Without knowing exactly what you have measured -- which mix of operations on which load factor with which mix of present/absent data --, it is difficult to explain where the difference come from.

The TR1 of g++ solve collision by chaining. This implies dynamic allocation. But this also gives better performance at high load level.

Answer 4

Your "homegrown" hash map is faster ¹ than std::tr1::unordered_map because, as you yourself said, your homegrown hash map is "simple" and it doesn't handle checking if the hash table is full . And possibly many things that you're not checking before operating on it. That may be the reason why your hash map is faster than std::tr1::unordered_map .

Also, the performance of std::tr1::unordered_map is defined by the implementation, so different implementation would perform differently speed-wise. You can see its implementation and compare it with yours, as that is the first thing you can do, and I believe, that will also answer your question to some extent.

^{1. I just assumed your claim to be correct, and based on it I said the above thing.}

Why is std::tr1::unordered_map slower than a homegrown hash map?

Question

4 answers

solution1
12 2011-04-16 13:14:19

solution2
6 2011-04-16 06:19:25

solution3
4 2011-04-16 06:16:48

solution4
1 2011-04-16 05:49:44

Why is std::tr1::unordered_map slower than a homegrown hash map?

Question

4 answers

solution1 12 2011-04-16 13:14:19

solution2 6 2011-04-16 06:19:25

solution3 4 2011-04-16 06:16:48

solution4 1 2011-04-16 05:49:44

solution1
12 2011-04-16 13:14:19

solution2
6 2011-04-16 06:19:25

solution3
4 2011-04-16 06:16:48

solution4
1 2011-04-16 05:49:44