more efficient structure as unordered_map<pair<int, int>, int>

Question

I have about 20,000,000 pair<int, int> which I need to associate to int s. I did so with an unordered_map<pair<int, int>, int> . Profiling my algorithm shows that checking whether an entry exists or not

bool exists = myMap[make_pair(a, b)] != NULL

is the performance bottleneck. I thought that retrieving this information from an unordered_map would be really fast, as it is O(1) . But constant time can be slow if the constant is big...

My hash-function is

template <>
struct tr1::hash<pair<int, int> > {
public:
        size_t operator()(pair<int, int> x) const throw() {
             size_t h = x.first * 1 + x.second * 100000;
             return h;
        }
};

Do you know any better data-structure for my problem?

Obviously I can't just store the information in a matrix, hence the amount of memory wouldn't fit into any computer in existence. All I know about the distribution is that myMap[make_pair(a, a)] doesn't exist for any a . And that all int s are in a continuous range from 0 to about 20,000,000.

Think of it as a sparse 20,000,000x20,000,000-Matrix with about 20,000,000 entries but never on the main diagonal.

Suggestion

Would a vector<pair<int, int>>* (array with N entries) expected to be faster? The lookup for a would be trivial (just the index of the array) and then I would iterate through the vector, comparing the first value of the pair to b .

BIG UPDATE

I uploaded the raw data so you can see the structure.

Answer 1

Have you tried using myMap.find(make_pair(a,b)) != myMap.end() ? operator[] creates the element if it does not exist. I would expect find to be faster.

Answer 2

First off, myMap[make_pair(a, b)] != NULL does not do what you think it does. It inserts the pair if it doesn't exist, and compares the mapped value to 0 (which is what NULL expands to). It does not check for existence at all. (Note that in modern C++, you should never use NULL . Use 0 for numbers and nullptr for pointers).

As for the main topic, your hash function doesn't seem too good. Don't forget that arithmetic on int s is done in int s. Since on most compilers int is 32-bit, its maximum value is a little over 2,000,000,000. So 20,000,000 * 10,000 is way bigger than that, leading to overflow (and undefined behaviour).

Given the number of your data, I assume you're on a 64-bit platform, which means size_t is 64 bits long. So you might get better results with a hash function like this:

size_t operator()(pair<int, int> x) const throw() {
     size_t f = x.first, s = x.second;
     return f << (CHAR_BIT * sizeof(size_t) / 2) | s;
}

This should produce significantly less collisions (and have defined behaviour) that what you have now.

If this doesn't help, you could also try a two-step approach:

std::unordered_map<int, std::unordered_map<int, int>>

Lookup by x.first first, then by x.second . I don't know if this would help; measure and see.

Answer 3

Main thing is definitely to avoid adding default-constructed elements with every search:

bool exists = myMap[make_pair(a, b)] != NULL; // OUCH

bool exists = myMap.find(make_pair(a, b)) != myMap.end();  // BETTER

iterator i = myMap.find(make_pair(a, b);
if (i != myMap.end()) ... else ...;      // MAY BE BEST - SEE BELOW

And the great hash challenge... woo hoo! This might be worth a shot, but a lot depends on how the numbers in the pairs are distributed and your implementation's std::hash (which is often pass-through!):

    size_t operator()(pair<int, int> x) const throw() {
         size_t hf = std::hash(x.first);
         return (hf << 2) ^ (hf >> 2) ^ std::hash(x.second);
    }

You may also find it faster if you replace the pair with int64_t s, so that the key comparisons are definitely simple integer comparisons rather than cascaded.

Also, what are you doing after the test for existence? If you need to access/change the value associated with the same key then you should save the iterator find returns and avoid another search.

Answer 4

As suggestet, I went with a vector<pair<int, int>>* with N entries. It's about 40% faster than the unordered_map .

Answer 5

I suggest you test with a better hash function. You can find examples if you search here on SO but this is one possible implementation.

struct pair_hash {
    template <typename T1, typename T2>
    size_t operator()(const std::pair<T1, T2> &pr) const {
        using std::hash;
        return hash<T1>()(pr.first) ^ hash<T2>()(pr.second);
    }
};

more efficient structure as unordered_map<pair<int, int>, int>

Question

Suggestion

BIG UPDATE

5 answers

solution1
5 2014-07-11 07:48:44

solution2
3 2014-07-11 08:21:17

solution3
2 2014-07-11 10:08:03

solution4
1 ACCPTED 2014-07-20 12:42:55

solution5
0 2014-07-11 08:32:26

more efficient structure as unordered_map<pair<int, int>, int>

Question

Suggestion

BIG UPDATE

5 answers

solution1 5 2014-07-11 07:48:44

solution2 3 2014-07-11 08:21:17

solution3 2 2014-07-11 10:08:03

solution4 1 ACCPTED 2014-07-20 12:42:55

solution5 0 2014-07-11 08:32:26

solution1
5 2014-07-11 07:48:44

solution2
3 2014-07-11 08:21:17

solution3
2 2014-07-11 10:08:03

solution4
1 ACCPTED 2014-07-20 12:42:55

solution5
0 2014-07-11 08:32:26