[英]Can i use GetHashCode() for all string compares?
i want to cache some search results based on the object to search and some search settings. 我想基于要搜索的对象和一些搜索设置来缓存一些搜索结果。
However: this creates quite a long cache key, and i thought i'd create a shortcut for it, and i thought i'd use GetHashCode()
for it. 但是:这会创建相当长的缓存键,我想我会为它创建一个快捷方式,我想我会使用
GetHashCode()
。
So i was wondering, does GetHashCode()
always generate a different number, even when i have very long strings or differ only by this: 'ä' in stead of 'a' 所以我想知道,
GetHashCode()
总是生成一个不同的数字,即使我有很长的字符串或只有这个不同:'ä'而不是'a'
I tried some strings and it seemed the answer is yes, but not understanding the GetHashCode()
behaviour doesn't give me the true feeling i am right. 我尝试了一些字符串, 似乎答案是肯定的,但不了解
GetHashCode()
行为并没有给我真正的感觉,我是对的。
And because it is one of those things which will pop up when you're not prepared (the client is looking at cached results for the wrong search) i want to be sure... 而且因为当你没有准备好时(客户端正在查看错误搜索的缓存结果),它会突然出现,我想确定...
EDIT: if MD5 would work, i can change my code not to use the GetHashCode ofcourse, the goals is to get a short(er) string than the original (> 1000 chars) 编辑:如果MD5可以工作,我可以改变我的代码不使用GetHashCode ofcourse,目标是得到一个短的(呃)字符串比原来(> 1000字符)
GetHashCode()
being unique. GetHashCode()
是唯一的。 There is an excellent article which investigates the likelihood of collisions available at http://kenneththorman.blogspot.com/2010/09/c-net-equals-and-gethashcode.html . 在http://kenneththorman.blogspot.com/2010/09/c-net-equals-and-gethashcode.html上有一篇很好的文章可以调查碰撞的可能性。 The findings were that "The smallest number of calls to GetHashCode() to return the same hashcode for a different string was after 565 iterations and the highest number of iterations before getting a hashcode collision was 296390 iterations. "
调查结果是“GetHashCode()调用不同字符串返回相同哈希码的最小次数是在565次迭代之后,获得哈希码冲突之前的最大迭代次数是296390次迭代。”
So that you can understand the contract for GetHashCode
implementations, the following is an excerpt from MSDN documentation for Object.GetHashCode()
: 因此,您可以了解
GetHashCode
实现的合同,以下是Object.GetHashCode()
MSDN文档的摘录:
A hash function must have the following properties: 哈希函数必须具有以下属性:
If two objects compare as equal, the GetHashCode method for each object must return the same value. 如果两个对象比较相等,则每个对象的GetHashCode方法必须返回相同的值。 However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.
但是,如果两个对象的比较不相等,则两个对象的GetHashCode方法不必返回不同的值。
The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object's Equals method. 只要没有对对象状态的修改来确定对象的Equals方法的返回值,对象的GetHashCode方法必须始终返回相同的哈希代码。 Note that this is true only for the current execution of an application, and that a different hash code can be returned if the application is run again.
请注意,这仅适用于当前应用程序的执行,并且如果再次运行应用程序,则可以返回不同的哈希代码。
For the best performance, a hash function must generate a random distribution for all input. 为获得最佳性能,哈希函数必须为所有输入生成随机分布。
Eric Lippert of the C# compiler team explains the rationale for the GetHashCode
implementation rules on his blog at http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/ . C#编译器团队的Eric Lippert在他的博客http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/上解释了
GetHashCode
实现规则的基本原理。
Logically GetHashCode
cannot be unique since there are only 2^32 ints and an infinite number of strings (see the pigeon hole principle). 逻辑上,
GetHashCode
不能是唯一的,因为只有2 ^ 32个int和无限数量的字符串(参见鸽子孔原理)。
As @Henk pointed out in the comment even though there are an infinite number of strings there are a finite number of System.String
s . 正如@Henk在评论中指出的那样,即使存在无限数量的字符串,也存在有限数量的
System.String
。 However the pigeon hole principle still stands as the later is much larger than int.MaxValue
. 然而,鸽子洞原则仍然存在,因为后者比
int.MaxValue
。
If one were store the hash code of each string along with the string itself, one could compare the hashcodes of strings as a "first step" to comparing them for equality. 如果存储每个字符串的哈希码以及字符串本身,则可以将字符串的哈希码作为“第一步”来比较它们的相等性。 If two strings have different hashcodes, they're not equal, and one needn't bother doing anything else.
如果两个字符串具有不同的哈希码,则它们不相等,并且不需要做任何其他事情。 If one expects to be comparing many pairs of strings which are of the same length, and which are "almost" but not quite equal, checking the hashcodes before checking the content may be a useful performance optimization.
如果希望比较具有相同长度且“几乎”但不完全相等的许多字符串对,则在检查内容之前检查哈希码可能是有用的性能优化。 Note that this "optimization" would not be worthwhile if one did not have cached hashcodes, since computing the hashcodes of two strings would almost certainly be slower than comparing them .
请注意,如果没有缓存的哈希码,这种“优化”就不值得,因为计算两个字符串的哈希码几乎肯定比比较它们要慢 。 If, however, one has had to compute and cache the hashcodes for some other purpose, checking hash codes as a first step to comparing strings may be useful.
但是,如果为了某些其他目的而必须计算和缓存哈希码,则检查哈希码作为比较字符串的第一步可能是有用的。
You always risk collisions when using GetHashCode() because you are operating within a limited number space, Int32, and this will also be exacerbated by the fact that hashing algorithms will not perfectly distribute within this space. 使用GetHashCode()时总是冒着冲突的风险,因为您在有限数量的空间Int32中运行,并且哈希算法无法在此空间内完美分布这一事实也会加剧这种情况。
If you look at the implementation of HashTable or Dictionary you will see that GetHashCode is used to assign the keys into buckets to cut down the number of comparisons required, however, the equality comparisons are still necessary if there are multiple items in the same bucket. 如果查看HashTable或Dictionary的实现,您将看到GetHashCode用于将密钥分配到存储桶中以减少所需的比较次数,但是,如果同一存储桶中有多个项目,则仍需要进行相等比较。
No. GetHasCode just provides a hash code. 不,GetHasCode只提供哈希码。 There will be collisions.
会有碰撞。 Having different hashes means the strings are different, but having the same hash does not mean the strings are the same.
具有不同的散列意味着字符串不同,但具有相同的散列并不意味着字符串是相同的。
Read these guidlelines by Eric Lippert for correct use of GetHashCode , they are quite instructing. 阅读Eric Lippert的这些guidlelines以正确使用GetHashCode ,他们非常指导。
If you want to compare strings, just do so! 如果你想比较字符串,就这样做吧!
stringA == stringB
works fine. stringA == stringB
工作正常。 If you want to ensure a string is unique in a large set, using the power of hash code to do so, use a HashSet<string>
. 如果要确保字符串在大型集合中是唯一的,请使用哈希代码的强大功能,使用
HashSet<string>
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.