简体繁体 English

如何测试哈希函数？

[英]How to test a hash function?

原文 2008-12-24 22:50:59 4 4 algorithm/ unit-testing/ language-agnostic/ testing/ hash

Is there a way to test the quality of a hash function? 有没有办法测试哈希函数的质量？ I want to have a good spread when used in the hash table, and it would be great if this is verifyable in a unit test. 我想在哈希表中使用时有一个很好的传播，如果在单元测试中可以验证它会很好。

EDIT : For clarification, my problem was that I have used long values in Java in such a way that the first 32 bit encoded an ID and the second 32 bit encoded another ID. 编辑：为了澄清，我的问题是我在Java中使用了long值，使得前32位编码ID，第二位32位编码另一个ID。 Unfortunately Java's hash of long values just XORs the first 32 bit with the second 32 bits, which in my case led to very poor performance when used in a HashMap . 不幸的是，Java的长值散列只是将前32位与第二位32位异或，这在我的情况下导致在HashMap使用时性能非常差。 So I need a different hash, and would like to have a Unit Test so that this problem cannot creep in any more. 所以我需要一个不同的哈希，并希望有一个单元测试，以便这个问题不再蔓延。

4 个解决方案

First I think you have to define what you mean by a good spread to yourself. 首先，我认为你必须通过对自己的良好传播来定义你的意思。 Do you mean a good spread for all possible input, or just a good spread for likely input? 您是指对所有可能的输入进行良好的传播，还是仅为可能的输入提供良好的传播？

For example, if you're hashing strings that represent proper full (first+last) names, you're not going to likely care about how things with the numerical ASCII characters hash. 例如，如果您正在散列表示正确的完整（第一个+最后一个）名称的字符串，那么您可能不会关心使用数字ASCII字符散列的内容。

As for testing, your best bet is to probably get a huge or random input set of data you expect, and push it through the hash function and see how the spread ends up. 至于测试，你最好的选择是获得你期望的大量或随机输入数据集，并通过哈希函数推送它，看看传播是如何结束的。 There's not likely going to be a magic program that can say "Yes, this is a good hash function for your use case.". 可能不会有一个魔术程序可以说“是的，这对你的用例来说是一个很好的哈希函数。” However, if you can programatically generate the input data, you should easily be able to create a unit test that generates a significant amount of it and then verify that the spread is within your definition of good. 但是，如果您可以以编程方式生成输入数据，则应该可以轻松地创建生成大量数据的单元测试，然后验证扩展是否在您的定义中。

Edit: In your case with a 64 bit long, is there even really a reason to use a hash map? 编辑：在64位长的情况下，是否真的有理由使用哈希映射？ Why not just use a balanced tree directly, and use the long as the key directly rather than rehashing it? 为什么不直接使用平衡树，直接使用long作为密钥而不是重新使用它？ You pay a little penalty in overall node size (2x the size for the key value), but may end up saving it in performance. 您在整体节点大小（键值大小的2倍）上支付一点点罚款，但最终可能会将其保存在性能上。

You have to test your hash function using data drawn from the same (or similar) distribution that you expect it to work on. 您必须使用从您期望它处理的相同（或类似）分发中提取的数据来测试您的哈希函数。 When looking at hash functions on 64-bit longs, the default Java hash function is excellent if the input values are drawn uniformly from all possible long values. 当查看64位长的散列函数时，如果从所有可能的长值统一绘制输入值，则默认的Java散列函数非常好。

However, you've mentioned that your application uses the long to store essentially two independent 32-bit values. 但是，您已经提到应用程序使用long来存储基本上两个独立的32位值。 Try to generate a sample of values similar to the ones you expect to actually use, and then test with that. 尝试生成一个类似于您期望实际使用的值的样本，然后使用它进行测试。

For the test itself, take your sample input values, hash each one and put the results into a set. 对于测试本身，获取样本输入值，对每个值进行散列并将结果放入集合中。 Count the size of the resulting set and compare it to the size of the input set, and this will tell you the number of collisions your hash function is generating. 计算结果集的大小，并将其与输入集的大小进行比较，这将告诉您哈希函数生成的冲突数。

For your particular application, instead of simply XORing them together, try combining the 32-bit values in ways a typical good hash function would combine two indepenet ints. 对于您的特定应用程序，不要简单地将它们一起进行异或，而是尝试将32位值组合在一起，典型的良好散列函数将组合两个独立的int。 Ie multiply by a prime, and add. 即乘以素数，然后加上。

If your using a chaining hash table, what you really care about is the number of collisions. 如果您使用链式哈希表，那么您真正关心的是冲突的数量。 This would be trivial to implement as a simple counter on your hash table. 在哈希表上作为一个简单的计数器实现这将是微不足道的。 Every time an item is inserted and the table has to chain, increment a chain counter. 每次插入一个项目并且表格必须链接时，递增链式计数器。 A better hashing algorithm will result in a lower number of collisions. 更好的散列算法将导致更少的冲突。 A good general purpose table hashing function to check out is: djb2 一个好的通用表哈希函数检查是： djb2

Based on your clarification: 根据您的澄清：

I have used long values in Java in such a way that the first 32 bit encoded an ID and the second 32 bit encoded another ID. 我在Java中使用了long值，使得前32位编码ID，第二位32位编码另一ID。 Unfortunately Java's hash of long values just XORs the first 32 bit with the second 32 bits, which in my case led to very poor performance when used in a HashMap. 不幸的是，Java的长值散列只是将前32位与第二位32位异或，这在我的情况下导致在HashMap中使用时性能非常差。

it appears you have some unhappy "resonances" between the way you assign the two ID values and the sizes of your HashMap instances. 看来你在分配两个ID值的方式和HashMap实例的大小之间存在一些不愉快的“共振”。

Are you explicitly sizing your maps, or using the defaults? 您是明确调整地图大小，还是使用默认值？ A QAD check seems to indicate that a HashMap<Long,String> starts with a 16-bucket structure and doubles on overflow. QAD检查似乎表明HashMap<Long,String>以16桶结构开始，并在溢出时加倍。 That would mean that only the low-order bits of the ID values are actually participating in the hash bucket selection. 这意味着只有ID值的低位实际上参与了散列桶选择。 You could try using one of the constructors that takes an initial-size parameter and create your maps with a prime initial size. 您可以尝试使用其中一个采用初始大小参数的构造函数，并使用初始大小创建地图。

Alternately, Dave L's suggestion of defining your own hashing of long keys would allow you to avoid the low-bit-dependency problem. 或者，Dave L'建议定义自己的长键散列将允许您避免低位依赖性问题。

Another way to look at this is that you're using a primitive type (long) as a way to avoid defining a real class. 另一种看待这种情况的方法是你使用原始类型（long）作为避免定义真实类的方法。 I'd suggest looking at the benefits you could achieve by defining the business classes and then implementing hash-coding, equality, and other methods as appropriate on your own classes to manage this issue. 我建议通过定义业务类，然后在您自己的类上实现哈希编码，相等和其他方法来管理此问题来查看可以实现的好处。