简体   繁体   English

更有效的结构为unordered_map <pair<int, int> ,int&gt;

[英]more efficient structure as unordered_map<pair<int, int>, int>

I have about 20,000,000 pair<int, int> which I need to associate to int s. 我有大约20,000,000 pair<int, int> ,我需要将其关联到int I did so with an unordered_map<pair<int, int>, int> . 我这样做是用unordered_map<pair<int, int>, int> Profiling my algorithm shows that checking whether an entry exists or not 对我的算法进行性能分析表明,检查条目是否存在

bool exists = myMap[make_pair(a, b)] != NULL

is the performance bottleneck. 是性能瓶颈。 I thought that retrieving this information from an unordered_map would be really fast, as it is O(1) . 我认为从unordered_map检索此信息将非常快,因为它是O(1) But constant time can be slow if the constant is big... 但是如果常数很大,常数时间可能会变慢...

My hash-function is 我的哈希函数是

template <>
struct tr1::hash<pair<int, int> > {
public:
        size_t operator()(pair<int, int> x) const throw() {
             size_t h = x.first * 1 + x.second * 100000;
             return h;
        }
};

Do you know any better data-structure for my problem? 您知道我的问题有更好的数据结构吗?

Obviously I can't just store the information in a matrix, hence the amount of memory wouldn't fit into any computer in existence. 显然,我不能仅将信息存储在矩阵中,因此内存容量无法容纳任何现有计算机。 All I know about the distribution is that myMap[make_pair(a, a)] doesn't exist for any a . 我所知道的所有分布情况是myMap[make_pair(a, a)]对于任何a都不存在。 And that all int s are in a continuous range from 0 to about 20,000,000. 并且所有int都在从0到大约20,000,000的连续范围内。

Think of it as a sparse 20,000,000x20,000,000-Matrix with about 20,000,000 entries but never on the main diagonal. 可以将其视为一个稀疏的20,000,000x20,000,000矩阵,其中包含大约20,000,000个条目,但绝不在主对角线上。

Suggestion 建议

Would a vector<pair<int, int>>* (array with N entries) expected to be faster? vector<pair<int, int>>* (具有N个条目的数组)会更快吗? The lookup for a would be trivial (just the index of the array) and then I would iterate through the vector, comparing the first value of the pair to b . a的查找将是微不足道的(只是数组的索引),然后我将遍历向量,将对的first值与b进行比较。

BIG UPDATE 大更新

I uploaded the raw data so you can see the structure. 我上传了原始数据,因此您可以看到结构。

Have you tried using myMap.find(make_pair(a,b)) != myMap.end() ? 您是否尝试过使用myMap.find(make_pair(a,b)) != myMap.end() operator[] creates the element if it does not exist. 如果元素不存在, operator[]创建该元素。 I would expect find to be faster. 我希望find更快。

First off, myMap[make_pair(a, b)] != NULL does not do what you think it does. 首先, myMap[make_pair(a, b)] != NULL不会执行您认为的操作。 It inserts the pair if it doesn't exist, and compares the mapped value to 0 (which is what NULL expands to). 如果不存在该对,则会插入该对,并将映射值与0比较(这是NULL扩展为的值)。 It does not check for existence at all. 它根本不检查是否存在。 (Note that in modern C++, you should never use NULL . Use 0 for numbers and nullptr for pointers). (请注意,在现代C ++中,永远不要使用NULL 。对于数字,请使用0;对于指针,请使用nullptr 。)

As for the main topic, your hash function doesn't seem too good. 至于主要主题,您的哈希函数似乎不太好。 Don't forget that arithmetic on int s is done in int s. 不要忘记对算术int S在做int秒。 Since on most compilers int is 32-bit, its maximum value is a little over 2,000,000,000. 由于在大多数编译器中, int是32位的,因此其最大值略高于2,000,000,000。 So 20,000,000 * 10,000 is way bigger than that, leading to overflow (and undefined behaviour). 因此20,000,000 * 10,000比这大得多,从而导致溢出(和不确定的行为)。

Given the number of your data, I assume you're on a 64-bit platform, which means size_t is 64 bits long. 给定您的数据数量,我假设您使用的是64位平台,这意味着size_t为64位长。 So you might get better results with a hash function like this: 因此,使用以下散列函数可能会获得更好的结果:

size_t operator()(pair<int, int> x) const throw() {
     size_t f = x.first, s = x.second;
     return f << (CHAR_BIT * sizeof(size_t) / 2) | s;
}

This should produce significantly less collisions (and have defined behaviour) that what you have now. 这将产生比现在少得多的碰撞(并且已经定义了行为)。

If this doesn't help, you could also try a two-step approach: 如果这样做没有帮助,您也可以尝试两步方法:

std::unordered_map<int, std::unordered_map<int, int>>

Lookup by x.first first, then by x.second . 首先通过x.first查找,然后通过x.second I don't know if this would help; 我不知道这是否有帮助; measure and see. 测量并查看。

Main thing is definitely to avoid adding default-constructed elements with every search: 最主要的是绝对要避免在每次搜索时都添加默认构造的元素:

bool exists = myMap[make_pair(a, b)] != NULL; // OUCH

bool exists = myMap.find(make_pair(a, b)) != myMap.end();  // BETTER

iterator i = myMap.find(make_pair(a, b);
if (i != myMap.end()) ... else ...;      // MAY BE BEST - SEE BELOW

And the great hash challenge... woo hoo! 以及巨大的哈希挑战...呜呼! This might be worth a shot, but a lot depends on how the numbers in the pairs are distributed and your implementation's std::hash (which is often pass-through!): 这可能值得一试,但是很大程度上取决于成对的数字的分布方式以及实现的std::hash (通常是直通!):

    size_t operator()(pair<int, int> x) const throw() {
         size_t hf = std::hash(x.first);
         return (hf << 2) ^ (hf >> 2) ^ std::hash(x.second);
    }

You may also find it faster if you replace the pair with int64_t s, so that the key comparisons are definitely simple integer comparisons rather than cascaded. 如果将对替换为int64_t ,则可能还会发现它更快,因此键比较绝对是简单的整数比较,而不是级联。

Also, what are you doing after the test for existence? 另外,在测试存在性之后您在做什么? If you need to access/change the value associated with the same key then you should save the iterator find returns and avoid another search. 如果需要访问/更改与同一键关联的值,则应保存迭代器的find返回值,并避免再次搜索。

As suggestet, I went with a vector<pair<int, int>>* with N entries. 作为建议,我使用了带有N个条目的vector<pair<int, int>>* It's about 40% faster than the unordered_map . 它比unordered_map快40%。

I suggest you test with a better hash function. 我建议您使用更好的哈希函数进行测试。 You can find examples if you search here on SO but this is one possible implementation. 如果您在此处搜索SO,则可以找到示例,但这是一种可能的实现。

struct pair_hash {
    template <typename T1, typename T2>
    size_t operator()(const std::pair<T1, T2> &pr) const {
        using std::hash;
        return hash<T1>()(pr.first) ^ hash<T2>()(pr.second);
    }
};

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM