更有效的结构为unordered_map <pair<int, int> ，int>

Question

I have about 20,000,000 pair<int, int> which I need to associate to int s. 我有大约20,000,000 pair<int, int> ，我需要将其关联到int 。 I did so with an unordered_map<pair<int, int>, int> . 我这样做是用unordered_map<pair<int, int>, int> 。 Profiling my algorithm shows that checking whether an entry exists or not 对我的算法进行性能分析表明，检查条目是否存在

bool exists = myMap[make_pair(a, b)] != NULL

is the performance bottleneck. 是性能瓶颈。 I thought that retrieving this information from an unordered_map would be really fast, as it is O(1) . 我认为从unordered_map检索此信息将非常快，因为它是O（1） 。 But constant time can be slow if the constant is big... 但是如果常数很大，常数时间可能会变慢...

My hash-function is 我的哈希函数是

template <>
struct tr1::hash<pair<int, int> > {
public:
        size_t operator()(pair<int, int> x) const throw() {
             size_t h = x.first * 1 + x.second * 100000;
             return h;
        }
};

Do you know any better data-structure for my problem? 您知道我的问题有更好的数据结构吗？

Obviously I can't just store the information in a matrix, hence the amount of memory wouldn't fit into any computer in existence. 显然，我不能仅将信息存储在矩阵中，因此内存容量无法容纳任何现有计算机。 All I know about the distribution is that myMap[make_pair(a, a)] doesn't exist for any a . 我所知道的所有分布情况是myMap[make_pair(a, a)]对于任何a都不存在。 And that all int s are in a continuous range from 0 to about 20,000,000. 并且所有int都在从0到大约20,000,000的连续范围内。

Think of it as a sparse 20,000,000x20,000,000-Matrix with about 20,000,000 entries but never on the main diagonal. 可以将其视为一个稀疏的20,000,000x20,000,000矩阵，其中包含大约20,000,000个条目，但绝不在主对角线上。

Suggestion 建议

Would a vector<pair<int, int>>* (array with N entries) expected to be faster? vector<pair<int, int>>* （具有N个条目的数组）会更快吗？ The lookup for a would be trivial (just the index of the array) and then I would iterate through the vector, comparing the first value of the pair to b . 对a的查找将是微不足道的（只是数组的索引），然后我将遍历向量，将对的first值与b进行比较。

BIG UPDATE 大更新

I uploaded the raw data so you can see the structure. 我上传了原始数据，因此您可以看到结构。

Answer 1

Have you tried using myMap.find(make_pair(a,b)) != myMap.end() ? 您是否尝试过使用myMap.find(make_pair(a,b)) != myMap.end() ？ operator[] creates the element if it does not exist. 如果元素不存在， operator[]创建该元素。 I would expect find to be faster. 我希望find更快。

Answer 2

First off, myMap[make_pair(a, b)] != NULL does not do what you think it does. 首先， myMap[make_pair(a, b)] != NULL不会执行您认为的操作。 It inserts the pair if it doesn't exist, and compares the mapped value to 0 (which is what NULL expands to). 如果不存在该对，则会插入该对，并将映射值与0比较（这是NULL扩展为的值）。 It does not check for existence at all. 它根本不检查是否存在。 (Note that in modern C++, you should never use NULL . Use 0 for numbers and nullptr for pointers). （请注意，在现代C ++中，永远不要使用NULL 。对于数字，请使用0；对于指针，请使用nullptr 。）

As for the main topic, your hash function doesn't seem too good. 至于主要主题，您的哈希函数似乎不太好。 Don't forget that arithmetic on int s is done in int s. 不要忘记对算术int S在做int秒。 Since on most compilers int is 32-bit, its maximum value is a little over 2,000,000,000. 由于在大多数编译器中， int是32位的，因此其最大值略高于2,000,000,000。 So 20,000,000 * 10,000 is way bigger than that, leading to overflow (and undefined behaviour). 因此20,000,000 * 10,000比这大得多，从而导致溢出（和不确定的行为）。

Given the number of your data, I assume you're on a 64-bit platform, which means size_t is 64 bits long. 给定您的数据数量，我假设您使用的是64位平台，这意味着size_t为64位长。 So you might get better results with a hash function like this: 因此，使用以下散列函数可能会获得更好的结果：

size_t operator()(pair<int, int> x) const throw() {
     size_t f = x.first, s = x.second;
     return f << (CHAR_BIT * sizeof(size_t) / 2) | s;
}

This should produce significantly less collisions (and have defined behaviour) that what you have now. 这将产生比现在少得多的碰撞（并且已经定义了行为）。

If this doesn't help, you could also try a two-step approach: 如果这样做没有帮助，您也可以尝试两步方法：

std::unordered_map<int, std::unordered_map<int, int>>

Lookup by x.first first, then by x.second . 首先通过x.first查找，然后通过x.second 。 I don't know if this would help; 我不知道这是否有帮助； measure and see. 测量并查看。

Answer 3

Main thing is definitely to avoid adding default-constructed elements with every search: 最主要的是绝对要避免在每次搜索时都添加默认构造的元素：

bool exists = myMap[make_pair(a, b)] != NULL; // OUCH

bool exists = myMap.find(make_pair(a, b)) != myMap.end();  // BETTER

iterator i = myMap.find(make_pair(a, b);
if (i != myMap.end()) ... else ...;      // MAY BE BEST - SEE BELOW

And the great hash challenge... woo hoo! 以及巨大的哈希挑战...呜呼！ This might be worth a shot, but a lot depends on how the numbers in the pairs are distributed and your implementation's std::hash (which is often pass-through!): 这可能值得一试，但是很大程度上取决于成对的数字的分布方式以及实现的std::hash （通常是直通！）：

    size_t operator()(pair<int, int> x) const throw() {
         size_t hf = std::hash(x.first);
         return (hf << 2) ^ (hf >> 2) ^ std::hash(x.second);
    }

You may also find it faster if you replace the pair with int64_t s, so that the key comparisons are definitely simple integer comparisons rather than cascaded. 如果将对替换为int64_t ，则可能还会发现它更快，因此键比较绝对是简单的整数比较，而不是级联。

Also, what are you doing after the test for existence? 另外，在测试存在性之后您在做什么？ If you need to access/change the value associated with the same key then you should save the iterator find returns and avoid another search. 如果需要访问/更改与同一键关联的值，则应保存迭代器的find返回值，并避免再次搜索。

Answer 4

As suggestet, I went with a vector<pair<int, int>>* with N entries. 作为建议，我使用了带有N个条目的vector<pair<int, int>>* 。 It's about 40% faster than the unordered_map . 它比unordered_map快40％。

Answer 5

I suggest you test with a better hash function. 我建议您使用更好的哈希函数进行测试。 You can find examples if you search here on SO but this is one possible implementation. 如果您在此处搜索SO，则可以找到示例，但这是一种可能的实现。

struct pair_hash {
    template <typename T1, typename T2>
    size_t operator()(const std::pair<T1, T2> &pr) const {
        using std::hash;
        return hash<T1>()(pr.first) ^ hash<T2>()(pr.second);
    }
};

更有效的结构为unordered_map <pair<int, int> ，int>

问题描述

Suggestion 建议

BIG UPDATE 大更新

5 个解决方案

解决方案1
5 2014-07-11 07:48:44

解决方案2
3 2014-07-11 08:21:17

解决方案3
2 2014-07-11 10:08:03

解决方案4
1 已采纳 2014-07-20 12:42:55

解决方案5
0 2014-07-11 08:32:26

更有效的结构为unordered_map <pair<int, int> ，int&gt;

问题描述

Suggestion 建议

BIG UPDATE 大更新

5 个解决方案

解决方案1 5 2014-07-11 07:48:44

解决方案2 3 2014-07-11 08:21:17

解决方案3 2 2014-07-11 10:08:03

解决方案4 1 已采纳 2014-07-20 12:42:55

解决方案5 0 2014-07-11 08:32:26

更有效的结构为unordered_map <pair<int, int> ，int>

解决方案1
5 2014-07-11 07:48:44

解决方案2
3 2014-07-11 08:21:17

解决方案3
2 2014-07-11 10:08:03

解决方案4
1 已采纳 2014-07-20 12:42:55

解决方案5
0 2014-07-11 08:32:26