[英]more efficient structure as unordered_map<pair<int, int>, int>
I have about 20,000,000 pair<int, int>
which I need to associate to int
s. 我有大约20,000,000
pair<int, int>
,我需要将其关联到int
。 I did so with an unordered_map<pair<int, int>, int>
. 我这样做是用
unordered_map<pair<int, int>, int>
。 Profiling my algorithm shows that checking whether an entry exists or not 对我的算法进行性能分析表明,检查条目是否存在
bool exists = myMap[make_pair(a, b)] != NULL
is the performance bottleneck. 是性能瓶颈。 I thought that retrieving this information from an
unordered_map
would be really fast, as it is O(1) . 我认为从
unordered_map
检索此信息将非常快,因为它是O(1) 。 But constant time can be slow if the constant is big... 但是如果常数很大,常数时间可能会变慢...
My hash-function is 我的哈希函数是
template <>
struct tr1::hash<pair<int, int> > {
public:
size_t operator()(pair<int, int> x) const throw() {
size_t h = x.first * 1 + x.second * 100000;
return h;
}
};
Do you know any better data-structure for my problem? 您知道我的问题有更好的数据结构吗?
Obviously I can't just store the information in a matrix, hence the amount of memory wouldn't fit into any computer in existence. 显然,我不能仅将信息存储在矩阵中,因此内存容量无法容纳任何现有计算机。 All I know about the distribution is that
myMap[make_pair(a, a)]
doesn't exist for any a
. 我所知道的所有分布情况是
myMap[make_pair(a, a)]
对于任何a
都不存在。 And that all int
s are in a continuous range from 0 to about 20,000,000. 并且所有
int
都在从0到大约20,000,000的连续范围内。
Think of it as a sparse 20,000,000x20,000,000-Matrix with about 20,000,000 entries but never on the main diagonal. 可以将其视为一个稀疏的20,000,000x20,000,000矩阵,其中包含大约20,000,000个条目,但绝不在主对角线上。
Would a vector<pair<int, int>>*
(array with N entries) expected to be faster? vector<pair<int, int>>*
(具有N个条目的数组)会更快吗? The lookup for a
would be trivial (just the index of the array) and then I would iterate through the vector, comparing the first
value of the pair to b
. 对
a
的查找将是微不足道的(只是数组的索引),然后我将遍历向量,将对的first
值与b
进行比较。
I uploaded the raw data so you can see the structure. 我上传了原始数据,因此您可以看到结构。
Have you tried using myMap.find(make_pair(a,b)) != myMap.end()
? 您是否尝试过使用
myMap.find(make_pair(a,b)) != myMap.end()
? operator[]
creates the element if it does not exist. 如果元素不存在,
operator[]
创建该元素。 I would expect find
to be faster. 我希望
find
更快。
First off, myMap[make_pair(a, b)] != NULL
does not do what you think it does. 首先,
myMap[make_pair(a, b)] != NULL
不会执行您认为的操作。 It inserts the pair if it doesn't exist, and compares the mapped value to 0 (which is what NULL
expands to). 如果不存在该对,则会插入该对,并将映射值与0比较(这是
NULL
扩展为的值)。 It does not check for existence at all. 它根本不检查是否存在。 (Note that in modern C++, you should never use
NULL
. Use 0 for numbers and nullptr
for pointers). (请注意,在现代C ++中,永远不要使用
NULL
。对于数字,请使用0;对于指针,请使用nullptr
。)
As for the main topic, your hash function doesn't seem too good. 至于主要主题,您的哈希函数似乎不太好。 Don't forget that arithmetic on
int
s is done in int
s. 不要忘记对算术
int
S在做int
秒。 Since on most compilers int
is 32-bit, its maximum value is a little over 2,000,000,000. 由于在大多数编译器中,
int
是32位的,因此其最大值略高于2,000,000,000。 So 20,000,000 * 10,000 is way bigger than that, leading to overflow (and undefined behaviour). 因此20,000,000 * 10,000比这大得多,从而导致溢出(和不确定的行为)。
Given the number of your data, I assume you're on a 64-bit platform, which means size_t
is 64 bits long. 给定您的数据数量,我假设您使用的是64位平台,这意味着
size_t
为64位长。 So you might get better results with a hash function like this: 因此,使用以下散列函数可能会获得更好的结果:
size_t operator()(pair<int, int> x) const throw() {
size_t f = x.first, s = x.second;
return f << (CHAR_BIT * sizeof(size_t) / 2) | s;
}
This should produce significantly less collisions (and have defined behaviour) that what you have now. 这将产生比现在少得多的碰撞(并且已经定义了行为)。
If this doesn't help, you could also try a two-step approach: 如果这样做没有帮助,您也可以尝试两步方法:
std::unordered_map<int, std::unordered_map<int, int>>
Lookup by x.first
first, then by x.second
. 首先通过
x.first
查找,然后通过x.second
。 I don't know if this would help; 我不知道这是否有帮助; measure and see.
测量并查看。
Main thing is definitely to avoid adding default-constructed elements with every search: 最主要的是绝对要避免在每次搜索时都添加默认构造的元素:
bool exists = myMap[make_pair(a, b)] != NULL; // OUCH
bool exists = myMap.find(make_pair(a, b)) != myMap.end(); // BETTER
iterator i = myMap.find(make_pair(a, b);
if (i != myMap.end()) ... else ...; // MAY BE BEST - SEE BELOW
And the great hash challenge... woo hoo! 以及巨大的哈希挑战...呜呼! This might be worth a shot, but a lot depends on how the numbers in the pairs are distributed and your implementation's
std::hash
(which is often pass-through!): 这可能值得一试,但是很大程度上取决于成对的数字的分布方式以及实现的
std::hash
(通常是直通!):
size_t operator()(pair<int, int> x) const throw() {
size_t hf = std::hash(x.first);
return (hf << 2) ^ (hf >> 2) ^ std::hash(x.second);
}
You may also find it faster if you replace the pair with int64_t
s, so that the key comparisons are definitely simple integer comparisons rather than cascaded. 如果将对替换为
int64_t
,则可能还会发现它更快,因此键比较绝对是简单的整数比较,而不是级联。
Also, what are you doing after the test for existence? 另外,在测试存在性之后您在做什么? If you need to access/change the value associated with the same key then you should save the iterator
find
returns and avoid another search. 如果需要访问/更改与同一键关联的值,则应保存迭代器的
find
返回值,并避免再次搜索。
As suggestet, I went with a vector<pair<int, int>>*
with N entries. 作为建议,我使用了带有N个条目的
vector<pair<int, int>>*
。 It's about 40% faster than the unordered_map
. 它比
unordered_map
快40%。
I suggest you test with a better hash function. 我建议您使用更好的哈希函数进行测试。 You can find examples if you search here on SO but this is one possible implementation.
如果您在此处搜索SO,则可以找到示例,但这是一种可能的实现。
struct pair_hash {
template <typename T1, typename T2>
size_t operator()(const std::pair<T1, T2> &pr) const {
using std::hash;
return hash<T1>()(pr.first) ^ hash<T2>()(pr.second);
}
};
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.