简体繁体 English

长度不同的两个字符串可以具有相同的哈希码吗？

[英]Can two strings of different length have the same hashcode?

原文 2016-10-06 04:53:15 4 2 java/ string/ hashcode

Although I am aware that two different Strings can return the same hashcode, I have been unable to find anything about two of differing lengths doing so. 尽管我知道两个不同的字符串可以返回相同的哈希码，但我无法找到有关两个不同长度的任何东西。 Is this possible, and if so, examples would be appreciated. 这是可能的，如果是这样，示例将不胜感激。 This is using the java hashcode function, in case that changes anything. 如果更改任何内容，则使用java哈希码函数。

2 个解决方案

Hashcodes are distributed over the space of an int . 哈希码分布在int的空间上。 The are only 2^32 = ~4 billion possible values for an int . 一个int仅有2^32 = ~4 billion可能值。 There are well more than that number possible strings, so by the pigeonhole principle, there must exist multiple strings with the same hash codes. 可能的字符串远远多于该数目，因此根据信鸽原则，必须存在多个具有相同哈希码的字符串。

However, this does not prove different length strings might have the same hash code, as pointed out below. 但是，这不能证明不同长度的字符串可能具有相同的哈希码，如下所述。 Java uses the formula s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1] for hashing strings. Java使用公式s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]来哈希字符串。 Knowing this, it is easy to construct strings of different length that have the same hash code: 知道这一点，很容易构造具有相同哈希码的不同长度的字符串：

Let String s1 = "\\001!"; 令String s1 = "\\001!"; and String s2 = "@"; 和String s2 = "@"; . 。 Then s1.length() != s2.length() but s1.hashCode() == '\\001' * 31 + '!' == 1 * 31 + 33 == 64 == s2.hashCode() == '@' == 64 然后s1.length() != s2.length()但s1.hashCode() == '\\001' * 31 + '!' == 1 * 31 + 33 == 64 == s2.hashCode() == '@' == 64 s1.hashCode() == '\\001' * 31 + '!' == 1 * 31 + 33 == 64 == s2.hashCode() == '@' == 64 . s1.hashCode() == '\\001' * 31 + '!' == 1 * 31 + 33 == 64 == s2.hashCode() == '@' == 64 。

However, let me again say that there are over 4 billion possible values of an int , so your probability of collision is low, although not as low as you might think, because of the Birthday Paradox , which gives that you have about a 50% chance of a collision after about 77K hashes (assuming hashes are randomly distributed, which really depends on your data - if you mostly deal with very small length strings you will have more frequent collisions). 但是，让我再说一遍，一个int可能有超过40 亿个值，因此由于Birthday Paradox ，您发生碰撞的可能性很低，尽管没有您想像的那么低，这使您拥有大约50％的值。在大约77K哈希之后发生冲突的可能性（假设哈希是随机分布的，这实际上取决于您的数据-如果您主要处理长度非常小的字符串，则冲突频率会更高）。 Every data structure that uses hashing deals must deal with collisions, though (eg a common way is to use linked lists at each hash position), or deal with loss of data (eg in a bloom filter). 但是，每个使用哈希交易的数据结构都必须处理冲突（例如，一种常见方法是在每个哈希位置使用链接列表），或者处理数据丢失（例如，在Bloom过滤器中）。

Yes, this can happen. 是的，这可能发生。

Some rather trivial examples: 一些简单的例子：

initial zero-valued characters don't affect the hash-code, so (for example) "foo" , "\\0foo" , "\\0\\0foo" , etc., all have the same hash-code. 初始零值字符不会影响哈希码，因此（例如） "foo" ， "\\0foo" ， "\\0\\0foo"等都具有相同的哈希码。
each character just gets multiplied by 31 before adding the next character; 每个字符仅需乘以31，然后再添加下一个字符； so (for example) the two-character string new String(new char[] { 12, 13 }) has the same hash-code as the single-character new String(new char[] { 12 * 31 + 13 }) (where I selected 12 and 13 arbitrarily; the same works for any other values, as long as the 12 * 31 + 13 analogue stays within the two-byte-unsigned-integer range). 因此，例如，两个字符的字符串new String(new char[] { 12, 13 })与单个字符new String(new char[] { 12 * 31 + 13 })具有相同的哈希码（其中我任意选择了12和13 ；只要12 * 31 + 13模拟值保持在2字节无符号整数范围内，其他任何值都可以使用相同的值）。

But those are just some easy-to-construct examples. 但是，这些只是一些易于构造的示例。 There are also plenty of pairs of strings that just happen to work out to have the same hash-code, despite no obvious relationship between them. 尽管它们之间没有明显的联系，但也有很多成对的字符串恰好具有相同的哈希码。