简体繁体 English

请回复::HashTable:Determining Table size and which hash function to use

[英]Please reply::HashTable:Determining Table size and which hash function to use

原文 2023-01-30 20:07:34 5 1 c++/ c++11/ key/ hashtable/ hash-function

If the input data entries are around 10 raised to power of 9, do we keep the size of the hash table the same as input size or reduce the size?如果输入数据条目大约是 10 的 9 次方，我们是保持 hash 表的大小与输入大小相同还是减小大小？ how to decide the table size?如何决定桌子的大小？
if we are using numbers in the range of 10 raised to power of 6 as the key, how do we hash these numbers to smaller values?如果我们使用 10 的 6 次方范围内的数字作为键，我们如何将 hash 这些数字变为更小的值？ I know we use the modulo operator but module with what?我知道我们使用取模运算符，但是用什么取模？

Kindly explain how these two things work.请解释这两件事是如何工作的。 Its getting quite confusing.它变得非常混乱。 Thanks!!谢谢！！

I tried to make the table size around 75% of the input data size, that you can call as X. Then I did key%(X) to get the hash code.我试图使表大小约为输入数据大小的 75%，您可以将其称为 X。然后我执行 key%(X) 以获取 hash 代码。 But I am not sure if this is correct.但我不确定这是否正确。

1 个解决方案

If the input data entries are around 10 raised to power of 9, do we keep the size of the hash table the same as input size or reduce the size?如果输入数据条目大约是 10 的 9 次方，我们是保持 hash 表的大小与输入大小相同还是减小大小？ how to decide the table size?如何决定桌子的大小？

The ratio of the number of elements stored to the number of buckets in the hash table is known as the load factor.在 hash 表中存储的元素数量与桶数量的比率称为加载因子。 In a separate chaining implementation, I'd suggest doing what std::unordered_set et al do and keeping it roughly in the range 0.5 to 1.0.在单独的链接实现中，我建议做std::unordered_set等人所做的并将其大致保持在 0.5 到 1.0 的范围内。 So, for 10^9 elements have 10^9 to 2x10^9 buckets.因此，对于 10^9 个元素，有 10^9 到 2x10^9 个桶。 Luckily, with separate chaining nothing awful happens if you go a bit outside this range (lower load factors just waste some memory on extra unused buckets, and higher load factors lead to increased collisions, longer lists and search times, but at load factors under 5 or 10 with an ok hash function the slow down will be roughly linear on average (so 5 or 10x slower than at load factor 1).幸运的是，如果你 go 稍微超出这个范围，使用单独的链接不会发生任何糟糕的事情（较低的负载因子只会在额外未使用的桶上浪费一些 memory，而较高的负载因子会导致冲突增加，列表和搜索时间更长，但负载因子低于 5或 10 正常 hash function 减速将大致呈平均线性（因此比负载系数 1 慢 5 或 10 倍）。

One important decision you should make is whether to pick a number around this magnitude that is a power of two, or a prime number.您应该做出的一个重要决定是，是选择一个大约为 2 的幂还是素数的数。 Explaining the implications is tedious, and anyway - which will work best for you is best determined by trying both and measuring the performance (if you really have to care about smallish differences in performance; if not - a prime number is the safer bet).解释其含义是乏味的，而且无论如何 - 最适合你的最好是通过尝试两者并测量性能来确定（如果你真的必须关心性能上的微小差异；如果不是 - 质数是更安全的选择）。

if we are using numbers in the range of 10 raised to power of 6 as the key, how do we hash these numbers to smaller values?如果我们使用 10 的 6 次方范围内的数字作为键，我们如何将 hash 这些数字变为更小的值？ I know we use the modulo operator but module with what?我知道我们使用取模运算符，但是用什么取模？

Are these keys unsigned integers?这些键是无符号整数吗？ In general, you can't have only 10^6 potential keys and end up with 10^9 hash table entries, as hash tables don't normally store duplicates (std::unordered_multiset/multi_map can, but it'll be easier for you to model that kind of thing as being a hash table from distinct keys to a container or values).通常，您不能只有 10^6 个潜在键并以 10^9 hash 表条目结束，因为 hash 表通常不存储重复项（std::unordered_multiset/multi_map 可以，但它会更容易你到 model 那种东西是从不同的键到容器或值的 hash 表）。 More generally, it's best to separate the act of hashing (which usually is expected to generate a size_t result), from the "folding" of the hash value over the number of buckets in the hash table.更一般地说，最好将散列操作（通常预计会生成 size_t 结果）与 hash 值在 hash 表中的桶数上的“折叠”分开。 That folding can be done using % in the general case, or by bitwise-ANDing with a bitmask for power-of-two bucket counts (eg for 256 buckets, & 255 is the same as % 256 , but may execute faster on the CPU when those 255/256 values aren't known at compile time).在一般情况下，折叠可以使用%来完成，或者通过与位掩码进行按位与运算以进行二次方桶计数（例如，对于 256 个桶， & 255与% 256相同，但可能在 CPU 上执行得更快当这些 255/256 值在编译时未知时）。

I tried to make the table size around 75% of the input data size, that you can call as X.我试图使表格大小约为输入数据大小的 75%，您可以将其称为 X。

So that's a load factor around 1.33, which is ok.所以这是一个大约 1.33 的负载系数，这没问题。

Then I did key%(X) to get the hash code.然后我做了 key%(X) 来获取 hash 代码。 But I am not sure if this is correct.但我不确定这是否正确。

It ends up being the same thing, but I'd suggest thinking of that as having a hash function hash(key) = key , followed by mod-ing into the bucket count.它最终是同一件事，但我建议将其视为具有 hash function hash(key) = key ，然后修改为存储桶计数。 Such a hash function is known as an identity hash function, and is the implementation used for integers by all major C++ compiler Standard Libraries, though no particular hash functions are specified in the C++ Standard.这样的 hash function 被称为身份 hash function，并且是所有主要 C++ 编译器标准库用于整数的实现，尽管 hash 函数在 Standard684988724 中没有指定It tends to work ok, but if your integer keys are particularly prone to collisions (for example, if they were all distinct multiples of 16 and your bucket count was a power of two they'd tend to only map to every 16th bucket) then it'd be better to use a stronger hash function. There are other questions about that - eg What integer hash function are good that accepts an integer hash key?它往往可以正常工作，但是如果您的 integer 键特别容易发生冲突（例如，如果它们都是 16 的不同倍数并且您的存储桶计数是 2 的幂，那么每第 16 个存储桶往往只有 map）然后最好使用更强的 hash function。还有其他问题 - 例如integer hash function 接受 integer 883518852204 是好的吗？

Rehashing重新散列

If the number of elements may increase dramatically beyond your initial expectations at run-time, then you'll want to increase the number of buckets to keep the load factor reasonable (in the range discussed above).如果元素的数量可能会在运行时急剧增加，超出您的初始预期，那么您将需要增加桶的数量以保持合理的负载因子（在上面讨论的范围内）。 Implementing support for that can easily be done by first writing a hash table class that doesn't support rehashing - simply taking the number of buckets to use as a constructor argument.通过首先编写不支持重新散列的 hash 表 class 可以轻松实现对此的支持 - 只需将桶的数量用作构造函数参数即可。 Then write an outer rehashing-capable hash table class with a data member of the above type, and when an insert would push the load factor too high (Standard Library containers have a max_load_factor member which defaults to 1.0), you can construct an additional inner hash table object telling the constructor a new larger bucket count to use, then iterate over the smaller hash table inserting (or - better - moving, see below) the elements to the new hash table, then swap the two hash tables so the data member ends up with the new larger content and the smaller one is destructed.然后使用上述类型的数据成员编写一个具有重新散列功能的外部 hash 表 class，当插入将加载因子推得太高时（标准库容器有一个默认为 1.0 的 max_load_factor 成员），您可以构造一个额外的内部hash 表 object 告诉构造函数要使用一个新的更大的桶计数，然后遍历较小的 hash 表插入（或 - 更好 - 移动，见下文）元素到新的 hash 表，然后交换两个 hash 表所以数据成员以新的较大内容结束，较小的内容被破坏。 By "moving" above With a little I mean simply relink linked list elements from the smaller hash table into the lists in the larger one, instead of deep copying the elements, which will be dramatically faster and use less memory momentarily while rehashing.通过“移动”上面一点点我的意思是简单地将链接列表元素从较小的 hash 表重新链接到较大表中的列表，而不是深度复制元素，这将显着加快并在重新散列时暂时使用更少的 memory。