简体   繁体   English

Java String上散列码溢出的后果

[英]Consequences of hashcode overflow on Java String

I've been reading a bit about Java String class' hashcode here recently, and I haven't been able to find this information : what happens when string's length is higher than 32 (I know an overflow then happens, but as a hash key, what happens)? 我最近在这里阅读了一些关于Java String类'哈希码的内容,但是我无法找到这些信息:当字符串的长度大于32时会发生什么(我知道会发生溢出,但是作为哈希键, 怎么了)? For example, I need to hash strings that are between 20 and 120 characters long to use them as hash keys. 例如,我需要散列长度在20到120个字符之间的字符串,以将它们用作散列键。 Do I need to implement my own algorithm using BigInteger? 我是否需要使用BigInteger实现自己的算法?

Also, since I might have between 30k and 80k strings, maybe more, is usual String hashcode collision-free enough? 另外,既然我可能有30k到80k之间的字符串,也许更多,那么通常的String hashcode是否足够无冲突?

(I know an overflow then happens, but as a hash key, what happens)? (我知道会发生溢出,但作为哈希键,会发生什么)?

In Java, arithmetic overflows and underflows of primitive types do not raise runtime errors or exceptions. 在Java中,原始类型的算术溢出和下溢不会引发运行时错误或异常。 The overflowed portion of the result is simply lost. 结果溢出的部分就完全丢失了。

While this can result in logic errors or other difficulties if the programmer is not aware of this property, it is the specified behavior of the JVM. 如果程序员不知道此属性,则会导致逻辑错误或其他困难,但这是JVM的指定行为。

You do not need to worry about overflow or underflow of int types when calculating hashcodes. 在计算哈希码时,您不必担心int类型的溢出或下溢。 The overflowed bits are simply lost. 溢出的位简直就丢失了。

This does not affect the correctness of the computed hash value or its ability to distribute to hash buckets well. 这不会影响计算的哈希值的正确性或其分配给哈希桶的能力。

Also, since I might have between 30k and 80k strings, maybe more, is usual String hashcode collision-free enough? 另外,既然我可能有30k到80k之间的字符串,也许更多,那么通常的String hashcode是否足够无冲突?

A couple things that can be handy to keep in mind: 一些可以方便记住的事情:

  • Java Strings are immutable. Java字符串是不可变的。 For this reason, the hash value of a String instance is calculated only once. 因此,String实例的哈希值只计算一次。 After that, the result is cached in the instance so that subsequent invocations of hashCode() do not result in repeated computations. 之后,结果缓存在实例中,以便后续调用hashCode()不会导致重复计算。 This works because Strings are immutable and recomputing the value would be the same every time. 这是有效的,因为字符串是不可变的,重新计算的值每次都是相同的。

  • The hash code really should be computed from all the meaningful information in an instance. 实际上应该根据实例中的所有有意义的信息来计算哈希码。 This means that if your String contains 20k of information, the hash code should be computed from all 20k of it (but see above). 这意味着如果你的String包含20k的信息,那么哈希码应该从它的所有20k中计算出来(但参见上文)。 Of course, there are performance implications, so you should design your program accordingly. 当然,有性能影响,所以你应该相应地设计你的程序。

  • Collision 'free'-ness has much, much more to do with the quality of your hashCode() implementation and less to do with the size of your Strings. 碰撞'free'-ness与hashCode()实现的质量有很大关系,而与你的字符串大小关系不大。 Algorithms used to generate hash codes should be capable of producing good distributions. 用于生成哈希码的算法应该能够产生良好的分布。 What a "good hash function" is isn't precisely known, but is a subject for mathematical theorists. 什么是“好的散列函数”并不是精确已知的,而是数学理论家的主题。 Fortunately it is not hard to define a hash function that is "good enough" even if it may not be "state of the art" (see Effective Java, 2nd ed.; J. Bloch). 幸运的是,定义一个“足够好”的哈希函数并不难,即使它可能不是“最先进的”(参见Effective Java,2nd ed .; J. Bloch)。

You are misunderstanding what hashCode() does. 你误解了hashCode()作用。 It calculates a 32-bit number that should be different for different values, but is not guaranteed to be so. 它计算一个32位数,对于不同的值应该是不同的,但不保证是这样。 How could it, then there might be more than 2^32 different values to hash. 怎么可能,那么哈希可能有超过2 ^ 32个不同的值。

For a String , the hashCode has nothing to do with the string length. 对于String ,hashCode与字符串长度无关。 Any hashCode is a valid hashCode for any string, as long as your always get the same hashCode for the same String, ie calling hashCode() multiple times for the same sequence of characters must return the same value. 任何hashCode都是任何字符串的有效hashCode,只要你总是为同一个String获得相同的 hashCode,即对同一个字符序列多次调用hashCode() 必须返回相同的值。

As an example, here are some hash codes for strings. 作为示例,这里是字符串的一些哈希码。

0x00000000 = "".hashCode()
0x00000061 = "a".hashCode()
0x00000041 = "A".hashCode()
0x042628b2 = "Hello".hashCode()
0x6f8f80f1 = "Goodbye".hashCode()
0xdbacdd53 = "The quick brown fox jumps over the lazy dog".hashCode()
0x99eecd2e = "The quick brown fox jumps over the lazy dog!".hashCode()

Notice that the last two are a very long (>32) string. 请注意,最后两个是一个非常长(> 32)的字符串。

There is no overflow on Strings. 字符串没有溢出。 Strings can be as long as your process' memory can hold. 字符串可以与进程的内存一样长。 The hashCode of any String is a 32-bit integer. 任何String的hashCode都是32位整数。 The collision frequency should not have a correlation with the String's length. 碰撞频率不应与String的长度相关。 You don't need to reimplement it. 你不需要重新实现它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM