Java哈希冲突概率

Question

I'm storing a large number of objects (with unique combinations of values stored in a byte array in the object) in a hashmap (~2.8million objects) and, when checking if I have any collision of hash code (32-bit hash), I'm very surprised to see there is none while statistically, I have nearly 100% chances of having at least one collision (cf. http://preshing.com/20110504/hash-collision-probabilities/ ). 我在散列图（约280万个对象）中存储大量对象（具有存储在对象中的字节数组中的值的唯一组合），并且在检查我是否有任何哈希码冲突时（32位散列）），我很惊讶地看到在统计上没有任何东西，我有近100％的机会至少有一次碰撞（参见http://preshing.com/20110504/hash-collision-probabilities/ ）。

I am thus wondering if my approach to detect collisions is bugged or if I'm extremely lucky... 因此，我想知道我检测碰撞的方法是否被窃听，或者我是否非常幸运......

Here is how I try to detect collisions from the 2.8million values stored in the map: 以下是我尝试从地图中存储的280万个值中检测碰撞的方法：

HashMap<ShowdownFreqKeysVO, Double> values;
(...fill with 2.8 mlns unique values...)
HashSet<Integer> hashes = new HashSet<>();
for (ShowdownFreqKeysVO key:values.keySet()){
    if (hashes.contains(key.hashCode())) throw new RuntimeException("Duplicate hash for:"+key);
    hashes.add(key.hashCode());
}

And here is the object's approach to create a hash value: 这是对象创建哈希值的方法：

public class ShowdownFreqKeysVO {
    //Values for the different parameters
    public byte[] values = new byte[12];

    @Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result + Arrays.hashCode(values);
        return result;
    }

    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        ShowdownFreqKeysVO other = (ShowdownFreqKeysVO) obj;
        if (!Arrays.equals(values, other.values))
            return false;
        return true;
    }
}

Any idea/hint on what I'm doing wrong would be greatly appreciated ! 任何关于我做错的想法/暗示都将不胜感激！

Thanks, Thomas 谢谢，托马斯

Answer 1

I don't believe in luck 我不相信运气

This is the implementation of Arrays.hashCode that you use 这是您使用的Arrays.hashCode的实现

public static int hashCode(int a[]) {
    if (a == null)
        return 0;

    int result = 1;
    for (int element : a)
        result = 31 * result + element;

    return result;
}

If your values happen to be smaller then 31 they are treated like distinct numbers in the base 31, so each result in a different numbers (if we ignore overflows for now). 如果你的值恰好小于31，则它们被视为基数31中的不同数字，因此每个结果都有不同的数字（如果我们现在忽略溢出）。 Lets call those pure hashes 让我们称之为纯粹的哈希

Now of course 31^11 is way larger then the number of integers in Java, so we will get tons of overflows. 现在当然31^11比Java中的整数数量大，所以我们将获得大量的溢出。 But since the powers of 31 and the maximum integer are "very different" you don't get a almost random distribution, but a very regular uniform distribution. 但由于31的幂和最大整数是“非常不同”，你不会得到几乎随机的分布，而是一个非常规则的均匀分布。

Lets consider a smaller example. 让我们考虑一个较小的例子。 I assume you have only 2 elements in your array and the range from 0 to 5 each. 我假设你的数组中只有2个元素，范围从0到5。 I try to create "hashCode" between 0 and 37 by taking the modulo 38 of the "pure hash" The result is that I get streaks of 5 integers with small gaps in between, and not a single collision. 我尝试通过取“纯哈希”的模38来创建0到37之间的“hashCode”。结果是我得到5个整数的条纹，中间有小间隙，而不是单个碰撞。

val hashes = for {
  i <- 0 to 4
  j <- 0 to 4
} yield (i * 31 + j) % 38

println(hashes.size) // prints 25
println(hashes.toSet.size) // prints 25

To verify if this is what happens to your numbers you might create a graph as follows: For each hash take the first 16 bits for x and and the second 16 bits for y, color that dot black. 要验证这是否是您的数字所发生的情况，您可以创建一个图形，如下所示：对于每个哈希，取x的前16位和y的第二个16位，点黑色。 I bet you will see an extremely regular pattern. 我打赌你会看到一个非常规律的模式。

Answer 2

I see nothing wrong with your code, but the analysis you link to assumes that hashCodes are uniformly distributed, and that the hashCodes of different objects are independent random variables. 我认为您的代码没有任何问题，但您链接的分析假设hashCodes是均匀分布的，并且不同对象的hashCodes是独立的随机变量。

The latter may not be true: You know that the objects are unique (and therefore not independent). 后者可能不是真的：您知道对象是唯一的（因此不是独立的）。 Perhaps that particular brand of uniqueness is preserved by the hashCode function. 也许hashCode函数保留了该特定品牌的唯一性。

Java哈希冲突概率

问题描述

2 个解决方案

解决方案1
5 已采纳 2013-12-21 15:44:08

解决方案2
0 2013-12-21 15:03:02

Java哈希冲突概率

问题描述

2 个解决方案

解决方案1 5 已采纳 2013-12-21 15:44:08

解决方案2 0 2013-12-21 15:03:02

解决方案1
5 已采纳 2013-12-21 15:44:08

解决方案2
0 2013-12-21 15:03:02