简体   繁体   English

IdentityHashMap.hash() 中这段代码的用途是什么?

[英]What is the purpose of this code in IdentityHashMap.hash()?

/**
 * Returns index for Object x.
 */
private static int hash(Object x, int length) {
    int h = System.identityHashCode(x);
    // Multiply by -127, and left-shift to use least bit as part of hash
    return ((h << 1) - (h << 8)) & (length - 1);
}

From: jdk/IdentityHashMap.java at jdk8-b120 · openjdk/jdk · GitHub来自: jdk/IdentityHashMap.java 在 jdk8-b120 · openjdk/jdk · GitHub

In theory, the hash values returned by System.identityHashCode() are already uniformly distributed, so why is there an additional shift operation instead of a direct AND operation with length - 1 ?理论上System.identityHashCode()返回的 hash 值已经是均匀分布的了,那为什么还要多一个移位运算而不是直接length - 1的 AND 运算呢?

The implementation seems to guarantee that the lowest bit is 0 to ensure that the result of the calculation is an even number, because the implementation requires all keys to be on even indices and all values to be on odd indices.该实现似乎保证最低位为 0 以确保计算结果为偶数,因为该实现要求所有键都在偶数索引上,所有值都在奇数索引上。

h << 8 seems to mix the low and high bits to handle the scenario when System.identityHashCode() is implemented as a memory address or an incrementing value, it is not clear why only 8 bits are shifted here instead of something like HashMap.hash() moves 16 bits as well. h << 8似乎混合了低位和高位来处理当 System.identityHashCode() 被实现为 memory 地址或递增值时的情况,目前尚不清楚为什么这里只移动 8 位而不是HashMap.hash()也移动 16 位。

The comments in the code say:代码中的注释说:

"Implementation note: This is a simple linear-probe hash table, as described for example in texts by Sedgewick and Knuth. The array alternates holding keys and values." “实施说明:这是一个简单的线性探针hash 表,如 Sedgewick 和 Knuth 的文本中的示例所述。数组交替保存键和值。”

In fact the the hash method returns a value that is used as a direct index into the array.事实上, hash方法返回一个值,该值用作数组的直接索引。 For example:例如:

public V get(Object key) {
    Object k = maskNull(key);
    Object[] tab = table;
    int len = tab.length;
    int i = hash(k, len);
    while (true) {
        Object item = tab[i];
        if (item == k)
            return (V) tab[i + 1];
        if (item == null)
            return null;
        i = nextKeyIndex(i, len);
    }
}

That means that hash needs to return an even value.这意味着hash需要返回偶数。 The calculation in hash is ensuring that the index is even without throwing away the bottom bit of the System.identityHashCode(x) value. hash中的计算确保索引是均匀的,而不会丢弃System.identityHashCode(x)值的底部位。

Why not just throw away the bottom bit?为什么不扔掉最底层的部分呢?

Well, the answer is in the way that System.identityHashCode is implemented.好吧,答案在于System.identityHashCode的实现方式。 In reality, there are multiple algorithms for generating the hash, and the algorithm used (at runtime) depends on an obscure JVM command line option.实际上,有多种算法可以生成 hash,所使用的算法(在运行时)取决于一个模糊的 JVM 命令行选项。

  • Some algorithms are (notionally) evenly distributed across the range of int .一些算法(名义上)均匀分布在int的范围内。 For those, discarding the bottom bit would be fine.对于那些人,丢弃底部位就可以了。

  • Other algorithm are not like this.其他算法不是这样的。 One of algorithm uses a simple global counter.其中一种算法使用简单的全局计数器。 Another uses the object's memory address with the bottom 3 bits removed.另一个使用对象的 memory 地址,删除了低 3 位。 If these algorithms are selected, discarding the LSB would increase the probability of hash collisions in IdentityHashMap .如果选择这些算法,丢弃 LSB 会增加IdentityHashMap中发生 hash 次冲突的概率。

See https://shipilev.net/jvm/anatomy-quarks/26-identity-hash-code/ for more information on IdentityHashcode algorithms and how they are selected.有关IdentityHashcode算法及其选择方式的更多信息,请参见https://shipilev.net/jvm/anatomy-quarks/26-identity-hash-code/ Note that this aspect of JVM behavior is unspecified and liable to be version specific.请注意,JVM 行为的这一方面是未指定的,并且可能是特定于版本的。

My hunch of what's going on here is that it's designed to address two issues.我对这里发生的事情的预感是它旨在解决两个问题。

First, the slot index this function produces must be an even number.首先,这个 function 产生的槽索引必须是偶数。 (The implementation stores keys at even table slots and values at odd table slots.) This means that whatever index is returned must have its last bit equal to zero. (该实现将键存储在偶数表槽中,将值存储在奇数表槽中。)这意味着无论返回什么索引,其最后一位都必须为零。

Second, the identity hash codes used are (potentially) based on memory addresses, and low bits of memory addresses are “more random” than high bits.其次,使用的标识 hash 代码(可能)基于 memory 地址,memory 地址的低位比高位“更随机”。 For example, if we allocate a list of objects and the allocator places them all consecutively in memory, their addresses will all have the same high bits but different low bits.例如,如果我们分配一个对象列表,并且分配器将它们全部连续放置在 memory 中,那么它们的地址将具有相同的高位但低位不同。 (Or perhaps there's just a global counter of objects that's incremented when an object is created. In that case, the low bits of object hashes will similarly have a wider dispersion than the high bits.) (或者也许只有一个全局对象计数器在创建 object 时递增。在这种情况下,object 哈希的低位将同样比高位具有更广泛的分散。)

To make sure things are spread out in the table, we'd therefore like to “mix” the low bits of the hash code with the “high” bits of the hash code.为了确保表格中的内容分散,我们因此希望将 hash 代码的低位与 hash 代码的“高”位“混合”。 The effect of subtracting out h << 8 is to shift the low bits of the identity hash code higher up, flip them, and add them back to the hash code, causing a bunch of “ripples” as the addition plays out.减去h << 8的效果是将标识 hash 代码的低位向上移动,翻转它们,然后将它们添加回 hash 代码,在加法运算时引起一堆“涟漪”。 I think (?) this is an effective way to then inject higher entropy low bits into the high bits, giving a more uniform hash over the array of slots once the table starts getting larger and larger.我认为(?)这是一种有效的方法,可以将更高熵的低位注入高位,一旦表开始变得越来越大,就可以在槽阵列上提供更均匀的 hash。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM