简体   繁体   English

Java 使用什么哈希函数来实现 Hashtable 类?

[英]What hashing function does Java use to implement Hashtable class?

From the book CLRS ("Introduction to Algorithms"), there are several hashing functions, such as mod, multiply, etc.从 CLRS(“算法导论”)一书中,有几个散列函数,例如 mod、multiply 等。

What hashing function does Java use to map the keys to slots? Java 使用什么散列函数将键映射到槽?

I have seen there is a question here Hashing function used in Java Language .我看到这里有一个问题Hashing function used in Java Language But it doesn't answer the question, and I think the marked answer for that question is wrong.但它没有回答这个问题,我认为那个问题的标记答案是错误的。 It says that hashCode() let you do your own hashing function for Hashtable, but I think it is wrong.它说 hashCode() 让你为 Hashtable 做你自己的散列函数,但我认为这是错误的。

The integer returned by hashCode() is the real key for Hashtble, then Hashtable uses a hashing function to hash the hashCode(). hashCode() 返回的整数是 Hashtble 的真正键,然后 Hashtable 使用散列函数对 hashCode() 进行散列。 What this answer implies is that Java give you a chance to give Hashtable a hashing function, but no, it is wrong.这个答案意味着 Java 给了你一个机会给 Hashtable 一个散列函数,但不,这是错误的。 hashCode() gives the real key, not the hashing function. hashCode() 给出真正的密钥,而不是散列函数。

So what exactly the hashing function does Java use?那么Java到底使用了什么样的哈希函数呢?

When a key is added to or requested from a HashMap in OpenJDK, the flow of execution is the following:在 OpenJDK 中向 HashMap 添加或请求键时,执行流程如下:

  1. The key is transformed into a 32-bit value using the developer-defined hashCode() method.使用开发人员定义的hashCode()方法将密钥转换为 32 位值。
  2. The 32-bit value is then transformed by a second hash function (of which Andrew's answer contains the source code) into an offset inside the hash table.然后,32 位值由第二个散列函数(安德鲁的答案包含源代码)转换为散列表内的偏移量。 This second hash function is provided by the implementation of HashMap and cannot be overridden by the developer.第二个散列函数由 HashMap 的实现提供,开发人员无法覆盖。
  3. The corresponding entry of the hash table contains a reference to a linked list or null, if the key does not yet exist in the hash table.如果哈希表中尚不存在键,则哈希表的相应条目包含对链表的引用或空值。 If there are collisions (several keys with the same offset), the keys together with their values are simply collected in a singly linked list.如果存在冲突(具有相同偏移量的几个键),键和它们的值被简单地收集在一个单向链表中。

If the hash table size was chosen appropriately high, the number of collisions will be limited.如果哈希表的大小选择得适当高,冲突的数量将受到限制。 Thus, a single lookup takes only constant time on average.因此,单次查找平均只需要恒定的时间。 This is called expected constant time .这称为预期常数时间 However, if an attacker has control over the keys inserted into a hash table and knowledge of the hash algorithm in use, he can provoke a lot of hash collisions and therefore force linear lookup time.但是,如果攻击者可以控制插入到哈希表中的密钥并了解正在使用的哈希算法,他可能会引发大量哈希冲突,从而强制执行线性查找时间。 This is why some hash table implementations have been changed recently to include a random element that makes it harder for an attacker to predict which keys will cause collisions.这就是为什么最近更改了一些哈希表实现以包含一个随机元素,这使得攻击者更难预测哪些键会导致冲突。

Some ASCII art一些 ASCII 艺术

key.hashCode()
     |
     | 32-bit value
     |                              hash table
     V                            +------------+    +----------------------+
HashMap.hash() --+                | reference  | -> | key1 | value1 | null |
                 |                |------------|    +----------------------+
                 | modulo size    | null       |
                 | = offset       |------------|    +---------------------+
                 +--------------> | reference  | -> | key2 | value2 | ref |
                                  |------------|    +---------------------+
                                  |    ....    |                       |
                                                      +----------------+
                                                      V
                                                    +----------------------+
                                                    | key3 | value3 | null |
                                                    +----------------------+

According to hashmap's source (java version < 8), every hashCode is hashed using the following method:根据hashmap 的来源(java 版本 < 8),每个 hashCode 都使用以下方法进行散列:

 /**
 * Applies a supplemental hash function to a given hashCode, which
 * defends against poor quality hash functions.  This is critical
 * because HashMap uses power-of-two length hash tables, that
 * otherwise encounter collisions for hashCodes that do not differ
 * in lower bits. Note: Null keys always map to hash 0, thus index 0.
 */
static int hash(int h) {
    // This function ensures that hashCodes that differ only by
    // constant multiples at each bit position have a bounded
    // number of collisions (approximately 8 at default load factor).
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

The reason every hashCode is hashed again is to further prevent a collision (see comments above)每个 hashCode 再次散列的原因是为了进一步防止冲突(见上面的评论)

HashMap also uses a method to determine the index of a hash code (java version < 8) (since length is always a power of 2, you can use & instead of %): HashMap 还使用一种方法来确定哈希码索引(java 版本 < 8)(因为长度总是 2 的幂,您可以使用 & 代替 %):

/**
 * Returns index for hash code h.
 */
static int indexFor(int h, int length) {
    return h & (length-1);
}

The put method looks something like: put 方法类似于:

int hash = hash(key.hashCode());
int i = indexFor(hash, table.length);

The purpose of a hash code is to provide a unique integer representation for a given object.哈希码的目的是为给定对象提供唯一的整数表示。 It makes sense, then, that Integer's hashCode method simply returns the value because each value would be unique to that Integer object.因此,Integer 的 hashCode 方法只返回值是有道理的,因为每个值对于该 Integer 对象都是唯一的。

Additional Ref:附加参考:
HashMap for java8 java8 的 HashMap
HashMap for java11 java11 的 HashMap

Hashing in general is divided into two steps: a.散列一般分为两个步骤: a. HashCode b.哈希码 B. Compressing压缩

In step a.在步骤 a。 an integer corresponding to your key is generated.生成与您的密钥相对应的整数。 This can be modified by you in Java.这可以由您在 Java 中修改。

In step b.在步骤 b 中。 a compression technique is applied by Java to map the integer returned by step a. Java 应用了一种压缩技术来映射步骤 a 返回的整数。 to a slot in the hashmap or hashtable.到哈希映射或哈希表中的一个插槽。 This compression technique cannot be changed.此压缩技术无法更改。

/**
 * Computes key.hashCode() and spreads (XORs) higher bits of hash
 * to lower.  Because the table uses power-of-two masking, sets of
 * hashes that vary only in bits above the current mask will
 * always collide. (Among known examples are sets of Float keys
 * holding consecutive whole numbers in small tables.)  So we
 * apply a transform that spreads the impact of higher bits
 * downward. There is a tradeoff between speed, utility, and
 * quality of bit-spreading. Because many common sets of hashes
 * are already reasonably distributed (so don't benefit from
 * spreading), and because we use trees to handle large sets of
 * collisions in bins, we just XOR some shifted bits in the
 * cheapest possible way to reduce systematic lossage, as well as
 * to incorporate impact of the highest bits that would otherwise
 * never be used in index calculations because of table bounds.
 */
static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

This is the latest hash function used by hashMap class in java这是java中hashMap类使用的最新散列函数

I think there is some confusion about the concept here.我认为这里的概念有些混乱。 A hash function maps a variable-size input to a fixed-size output (the hash value).散列函数将可变大小的输入映射到固定大小的输出(散列值)。 In the case of Java objects the output is a 32-bit signed integer.对于 Java 对象,输出是一个 32 位有符号整数。

Java's Hashtable use the hash value as an index into an array where the actual object is stored, taking modulo arithmetic and collisions into account. Java 的 Hashtable 使用哈希值作为存储实际对象的数组的索引,同时考虑了模运算和冲突。 However, this is not hashing.然而,这不是散列。

The java.util.HashMap implementation performs some additional bit swapping on the hash value before indexing to protect against excessive collisions in some cases. java.util.HashMap 实现在索引之前对哈希值执行一些额外的位交换,以防止在某些情况下发生过度冲突。 It is called "additional hash", but I don't think that is a correct term.它被称为“附加哈希”,但我认为这不是一个正确的术语。

To put it in a very simple way the second hashing is nothing but finding the index number of the bucket array where the new key-value pair will be stored.用一种非常简单的方式来说,第二次散列就是找到存储新键值对的桶数组的索引号。 This mapping is done to get the index number from the bigger int value of the hashcode of the key obj.完成此映射是为了从键 obj 的哈希码的较大 int 值中获取索引号。 Now if two unequal key objects have same hash code then collision will happen as they will be mapped to the same array index.现在,如果两个不相等的键对象具有相同的哈希码,则会发生冲突,因为它们将映射到相同的数组索引。 In this case the second key along with it's value will be added to the linked list.在这种情况下,第二个键及其值将被添加到链表中。 Here the array index will point to the last node added.这里数组索引将指向添加的最后一个节点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM