简体   繁体   English

object.GetHashCode()能否为不同机器上的相同对象(字符串)产生不同的结果?

[英]Can object.GetHashCode() produce different results for the same objects (strings) on different machines?

Is it possible one and the same object, particularly a string or any primitive or very simple type (like a struct ), to produce different values of the .GetHashCode() method when invoked on different machines? 是否有可能同一个对象,尤其是string或任何原始类型或非常简单的类型(如struct ),在不同的机器上调用时生成.GetHashCode()方法的不同值?

For instance, is it possible for the expression "Hello World".GetHashCode() to produce a different value on a different machine. 例如,表达式"Hello World".GetHashCode()可以在不同的机器上生成不同的值。 I am primarily asking for C#.NET but I suppose this might apply to Java or even other languages? 我主要是要求C#.NET,但我想这可能适用于Java甚至其他语言?

Edit: 编辑:

As pointed from answers and comments below, it is known to me that .GetHashCode() can be overriden , and there is no guarantee for the result it produces between different version of the framework. 正如下面的答案和评论所指出的那样,我知道.GetHashCode()可以被覆盖 ,并且不能保证它在不同版本的框架之间产生的结果。 Therefore it is important to clarify that I have simple types in mind (which cannot be inherited, therefore GetHashCode() be overriden) and I am using the same versions of the framework on all machines. 因此,重要的是要澄清我有简单的类型(不能继承,因此GetHashCode()被覆盖)并且我在所有机器上使用相同版本的框架。

Short answer: Yes. 简短回答:是的。

But short answers are no fun, are they? 但简短的答案并不好玩,是吗?

When you are implementing GetHashCode() you have to make the following guarantee: 当您实现GetHashCode()您必须做出以下保证:

When GetHashCode() is called on another object that should be considered equal to this, in this App Domain, the same value will be returned. 当在另一个应被视为等于此对象的对象上调用GetHashCode() ,在此App Domain中将返回相同的值。

That's it. 而已。 There's some things you really need to try to do (spread the bits around with non-equal objects as much as possible, but don't take so long about it that it outweighs all the benefits of hashing in the first place) and your code will suck if you don't do so, but it won't actually break. 有一些事情你真的需要尝试做(尽可能多地使用不相等的对象扩散,但不要花太多时间,它首先超过散列的所有好处)和你的代码如果你不这样做会很糟糕,但它实际上不会破裂。 It will break if you don't go that far, because then eg: 如果你不走那么远就会破裂,因为那样:

dict[myObj] = 3;
int x = dict[myObj];//KeyNotFoundException

Okay. 好的。 If I'm implementing GetHashCode() , why might I go further than that, and why might I not? 如果我正在实现GetHashCode() ,为什么我会更进一步,为什么不呢?

First, why might I not? 首先,为什么我不呢?

Maybe it's a slightly different version of the assembly and I improved (or at least attempted to) in between builds. 也许这是一个稍微不同的程序集版本,我在构建之间改进(或至少尝试过)。

Maybe one is 32-bit and one is 64-bit and I was going nuts for efficiency and chose a different algorithm for each to make use of the different word sizes (this is not unheard of, especially when hashing objects like collections or strings). 也许一个是32位,一个是64位,我为了效率而疯狂,并为每个选择不同的算法来使用不同的字大小(这不是闻所未闻的,尤其是在散列像集合或字符串这样的对象时) 。

Maybe some element I'm deciding to consider in deciding on what constitutes "equal" objects is itself varying from system to system in this sort of way. 也许我决定在决定什么构成“平等”对象时要考虑的一些因素本身就是这种方式在不同系统之间变化的。

Maybe I actually deliberately introduce a different seed with different builds to catch any case where a colleague is mistakenly depending upon my hash code! 也许我实际上故意引入一个不同构建的不同种子来捕捉任何同事错误依赖我的哈希码的情况! (I've heard MS do this with their implementation for string.GetHashCode() , but can't remember whether I heard that from a credible or credulous source). (我听说MS使用string.GetHashCode()的实现来做这件事,但是不记得我是否从可靠或轻信的来源中听到了这一点。

Mainly though, it'll be one of the first two reasons. 主要是,这将是前两个原因之一。

Now, why might I give such a guarantee? 现在,为什么我可以给出这样的保证?

Most likely if I do, it'll be by chance. 如果我这么做的话,很可能是偶然的。 If an element can be compared for equality on the basis of a single integer id alone, then that's what I'm going to use as my hash-code. 如果可以仅基于单个整数id来比较元素的相等性,那么我将使用它作为我的哈希码。 Anything else will be more work for a less good hash. 对于不太好的哈希,任何其他东西都会更有效。 I'm not likely to change this, so I might. 我不太可能改变这个,所以我可能会。

The other reason why I might, is that I want that guarantee myself. 我可能的另一个原因是我自己想要保证。 There's nothing to say I can't provide it, just that I don't have to. 没有什么可说的,我不能提供它,只是我不需要。


Okay, let's get to something practical. 好的,让我们做一些实用的事情。 There are cases where you may want a machine-independent guarantee. 在某些情况下,您可能需要与机器无关的保证。 There are cases where you may want the opposite, which I'll come to in a bit. 有些情况下你可能会想要相反的情况,我会稍微谈谈。

First, check your logic. 首先,检查你的逻辑。 Can you handle collisions? 你能处理碰撞吗? Good, then we'll begin. 好的,那我们就开始吧。

If it's your own class, then implement so as to provide such a guarantee, document it, and you're done. 如果它是你自己的类,那么实现以便提供这样的保证,记录它,你就完成了。

If it's not your class, then implement IEqualityComparer<T> in such a way as to provide it. 如果它不是你的类,那么以提供它的方式实现IEqualityComparer<T> For example: 例如:

public class ConsistentGuaranteedComparer : IEqualityComparer<string>
{
  public bool Equals(string x, string y)
  {
    return x == y;
  }
  public int GetHashCode(string obj)
  {
    if(obj == null)
      return 0;
    int hash = obj.Length;
    for(int i = 0; i != obj.Length; ++i)
      hash = (hash << 5) - hash + obj[i];
    return hash;
  }
}

Then use this instead of the built-in hash-code. 然后使用它而不是内置的哈希码。

There's an interesting case where we may want the opposite. 有一个有趣的案例,我们可能想要相反的情况。 If I can control the set of strings you are hashing, then I can pick a bunch of strings with the same hash-code. 如果我可以控制你正在散列的字符串集,那么我可以选择一堆具有相同哈希码的字符串。 Your hash-based collection's performance will hit the worse-case and be pretty atrocious. 你的基于哈希的集合的性能将会变得更糟,并且非常糟糕。 Chances are I can keep doing this faster than you can deal with it, so it can be a denial of service attack. 我可以继续比你处理它更快地做到这一点,所以它可能是一种拒绝服务攻击。 There's not many cases where this happens, but an important one is if you're handling XML documents I send and you can't just rule out some elements (a lot of formats allow for freedom of elements within them). 发生这种情况的情况并不多,但重要的是,如果您正在处理我发送的XML文档,您不能仅排除某些元素(许多格式允许其中的元素自由)。 Then the NameTable inside your parser will be hurt. 然后解析器中的NameTable会受到伤害。 In this case we create a new hash mechanism each time: 在这种情况下,我们每次都创建一个新的哈希机制:

public class RandomComparer : IEqualityComparer<string>
{
  private int hashSeed = Environment.TickCount;
  public bool Equals(string x, string y)
  {
    return x == y;
  }
  public int GetHashCode(string obj)
  {
    if(obj == null)
      return 0;
    int hash = hashSeed + obj.Length;
    for(int i = 0; i != obj.Length; ++i)
      hash = hash << 5 - hash + obj[i];
    hash += (hash <<  15) ^ 0xffffcd7d;
    hash ^= (hash >>> 10);
    hash += (hash <<   3);
    hash ^= (hash >>>  6);
    hash += (hash <<   2) + (hash << 14);
    return hash ^ (hash >>> 16)
  }
}

This will be consistent within a given use, but not consistent from use to use, so an attacker can't construct input to force it to be DoSsed. 这将在给定的使用中保持一致,但从使用到使用不一致,因此攻击者无法构造输入以强制它为DoSsed。 Incidentally, NameTable doesn't use an IEqualityComparer<T> because it wants to deal with char-arrays with indices and lengths without constructing a string unless necessary, but it does do something similar. 顺便说一下, NameTable不使用IEqualityComparer<T>因为它想要处理具有索引和长度的char数组而不构造字符串,除非必要,但它确实做了类似的事情。

Incidentally, in Java the hash-code for string is specified and won't change, but this may not be the case for other classes. 顺便说一句,在Java中, string的哈希码被指定并且不会改变,但对于其他类可能不是这种情况。

Edit: Having done some research into the overall quality of the approach taken in ConsistentGuaranteedComparer above, I'm no longer happy with having such algorithms in my answers; 编辑:我已经对上面ConsistentGuaranteedComparer采用的方法的整体质量进行了一些研究,我不再满足于在我的答案中使用这些算法; while it serves to describe the concept, it doesn't have as good a distribution as one might like. 虽然它用于描述这个概念,但它并没有像人们想象的那样好。 Of course, if one has already implemented such a thing, then one can't change it without breaking the guarantee, but if I'd now recommend using this library of mine, written after said research as follows: 当然,如果一个人已经实现了这样的事情,那么在不破坏保证的情况下就不能改变它,但如果我现在建议使用我的这个库,那么在研究之后写的如下:

public class ConsistentGuaranteedComparer : IEqualityComparer<string>
{
  public bool Equals(string x, string y)
  {
    return x == y;
  }
  public int GetHashCode(string obj)
  {
    return obj.SpookyHash32();
  }
}

That for RandomComparer above isn't as bad, but can also be improved: 对于上面的RandomComparer ,并没有那么糟糕,但也可以改进:

public class RandomComparer : IEqualityComparer<string>
{
  private int hashSeed = Environment.TickCount;
  public bool Equals(string x, string y)
  {
    return x == y;
  }
  public int GetHashCode(string obj)
  {
    return obj.SpookyHash32(hashSeed);
  }
}

Or for even harder predictability: 或者更难预测:

public class RandomComparer : IEqualityComparer<string>
{
  private long seed0 = Environment.TickCount;
  private long seed1 = DateTime.Now.Ticks;
  public bool Equals(string x, string y)
  {
    return x == y;
  }
  public int GetHashCode(string obj)
  {
    return obj.SpookyHash128(seed0, seed1).GetHashCode();
  }
}

It will produce different result even on the same machine on different runs. 即使在不同的运行中,它也会在同一台机器上产生不同的结果。

So it basically can be used (and it is actually used) to check something during the current run of the program, but there is no sence to store it, to check something against it after. 所以它基本上可以用来(并且它实际上是用来)在程序的当前运行期间检查某些东西,但是没有意义来存储它,以便在之后检查它。 Cause the number you get is generated by runtime . 导致您获得的数字是由运行时生成的。

EDIT 编辑

For specific case of a string it will produce the same result even on different machines, except the case when machines have different architecture. 对于字符串的特定情况,即使在不同的机器上,它也会产生相同的结果,除非机器具有不同的架构。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM