简体   繁体   English

为什么C#没有为集合实现GetHashCode?

[英]Why does C# not implement GetHashCode for Collections?

I am porting something from Java to C#. 我正在将一些东西从Java移植到C#。 In Java the hashcode of a ArrayList depends on the items in it. 在Java中, ArrayListhashcode取决于其中的项。 In C# I always get the same hashcode from a List ... 在C#中,我总是从List获得相同的哈希码...

Why is this? 为什么是这样?

For some of my objects the hashcode needs to be different because the objects in their list property make the objects non-equal. 对于我的一些对象,哈希码需要不同,因为列表属性中的对象使对象不相等。 I would expect that a hashcode is always unique for the object's state and only equals another hashcode when the object is equal. 我希望哈希码对于对象的状态始终是唯一的,并且当对象相等时仅等于另一个哈希码。 Am I wrong? 我错了吗?

In order to work correctly, hashcodes must be immutable – an object's hash code must never change. 为了正常工作,哈希码必须是不可变的 - 对象的哈希码必须永远不会改变。

If an object's hashcode does change, any dictionaries containing the object will stop working. 如果对象的哈希码确实发生了变化,那么包含该对象的任何词典都将停止工作。

Since collections are not immutable, they cannot implement GetHashCode . 由于集合不是不可变的,因此它们无法实现GetHashCode
Instead, they inherit the default GetHashCode , which returns a (hopefully) unique value for each instance of an object. 相反,它们继承了默认的GetHashCode ,它为对象的每个实例返回(希望)唯一值。 (Typically based on a memory address) (通常基于内存地址)

Yes, you are wrong. 是的,你错了。 In both Java and C#, being equal implies having the same hash-code, but the converse is not (necessarily) true. 在Java和C#中,相等意味着具有相同的哈希码,但反过来并不(必然)为真。

See GetHashCode for more information. 有关更多信息,请参阅GetHashCode

Hashcodes must depend upon the definition of equality being used so that if A == B then A.GetHashCode() == B.GetHashCode() (but not necessarily the inverse; A.GetHashCode() == B.GetHashCode() does not entail A == B ). Hashcodes必须依赖于所使用的相等的定义,这样如果A == BA.GetHashCode() == B.GetHashCode() (但不一定是逆; A.GetHashCode() == B.GetHashCode()不需要A == B )。

By default, the equality definition of a value type is based on its value, and of a reference type is based on it's identity (that is, by default an instance of a reference type is only equal to itself), hence the default hashcode for a value type is such that it depends on the values of the fields it contains* and for reference types it depends on the identity. 默认情况下,值类型的等式定义基于其值,而引用类型的等式定义基于其标识(即,默认情况下,引用类型的实例仅等于其自身),因此默认的哈希码为值类型是这样的,它取决于它包含的字段的值*,对于引用类型,它取决于标识。 Indeed, since we ideally want the hashcodes for non-equal objects to be different particularly in the low-order bits (most likely to affect the value of a re-hashing), we generally want two equivalent but non-equal objects to have different hashes. 实际上,因为我们理想地希望非等对象的哈希码特别是在低阶位(最有可能影响重新散列的值)中不同,我们通常希望两个等价但不相等的对象具有不同的哈希值。

Since an object will remain equal to itself, it should also be clear that this default implementation of GetHashCode() will continue to have the same value, even when the object is mutated (identity does not mutate even for a mutable object). 由于对象将保持与自身相等,因此即使对象发生变异(即使对于可变对象,身份也不会发生变异GetHashCode() ,也应该清楚GetHashCode()默认实现将继续具有相同的值。

Now, in some cases reference types (or value types) re-define equality. 现在,在某些情况下,引用类型(或值类型)重新定义相等性。 An example of this is string, where for example "ABC" == "AB" + "C" . 一个例子是字符串,例如"ABC" == "AB" + "C" Though there are two different instances of string compared, they are considered equal. 虽然比较了两个不同的字符串实例,但它们被认为是相同的。 In this case GetHashCode() must be overridden so that the value relates to the state upon which equality is defined (in this case, the sequence of characters contained). 在这种情况下,必须重写GetHashCode()以便该值与定义相等性的状态(在本例中为包含的字符序列)相关。

While it is more common to do this with types that also are immutable, for a variety of reasons, GetHashCode() does not depend upon immutability . 虽然使用也是不可变的类型更常见,但由于各种原因, GetHashCode()不依赖于不变性 Rather, GetHashCode() must remain consistent in the face of mutability - change a value that we use in determining the hash, and the hash must change accordingly. 相反, GetHashCode()必须在可变性面前保持一致 - 更改我们在确定哈希时使用的值,并且哈希必须相应地更改。 Note though, that this is a problem if we are using this mutable object as a key into a structure using the hash, as mutating the object changes the position in which it should be stored, without moving it to that position (it's also true of any other case where the position of an object within a collection depends on its value - eg if we sort a list and then mutate one of the items in the list, the list is no longer sorted). 但请注意,如果我们使用这个可变对象作为使用哈希的结构的键,这是一个问题,因为改变对象会改变它应该存储的位置,而不会将其移动到该位置(它也是如此)任何其他情况,其中集合中对象的位置取决于其值 - 例如,如果我们对列表进行排序然后改变列表中的一个项目,则不再对列表进行排序)。 However, this doesn't mean that we must only use immutable objects in dictionaries and hashsets. 但是,这并不意味着我们必须只在字典和散列集中使用不可变对象。 Rather it means that we must not mutate an object that is in such a structure, and making it immutable is a clear way to guarantee this. 相反,它意味着我们不能改变这种结构中的对象,并使其不可变是一种明确的方法来保证这一点。

Indeed, there are quite a few cases where storing mutable objects in such structures is desirable, and as long as we don't mutate them during this time, this is fine. 实际上,有很多情况下需要在这种结构中存储可变对象,并且只要我们在此期间不改变它们,这就没问题了。 Since we don't have the guarantee immutability brings, we then want to provide it another way (spending a short time in the collection and being accessible from only one thread, for example). 由于我们没有不可变性带来的保证,因此我们希望以另一种方式提供它(例如在集合中花费很短的时间并且只能从一个线程访问)。

Hence immutability of key values is one of those cases where something is possible, but generally a idea. 因此,关键值的不变性是可能的事情之一,但通常是一个想法。 To the person defining the hashcode algorithm though, it's not for them to assume any such case will always be a bad idea (they don't even know the mutation happened while the object was stored in such a structure); 但是,对于定义哈希码算法的人来说,并不是他们认为任何这样的情况总是一个坏主意(他们甚至不知道在对象存储在这样的结构中时发生了变异); it's for them to implement a hashcode defined on the current state of the object, whether calling it in a given point is good or not. 它们是为了实现在对象的当前状态上定义的哈希码,无论是否在给定点调用它都是好的。 Hence for example, a hashcode should not be memoised on a mutable object unless the memoisation is cleared on every mutate. 因此,例如,除非在每个mutate上清除memoisation,否则不应在可变对象上记忆哈希码。 (It's generally a waste to memoise hashes anyway, as structures that hit the same objects hashcode repeatedly will have their own memoisation of it). (无论如何,记忆哈希通常都是浪费,因为反复敲击相同对象哈希码的结构会有自己的备忘录)。

Now, in the case in hand, ArrayList operates on the default case of equality being based on identity, eg: 现在,在手头的情况下,ArrayList在基于身份的默认情况下进行操作,例如:

ArrayList a = new ArrayList();
ArrayList b = new ArrayList();
for(int i = 0; i != 10; ++i)
{
  a.Add(i);
  b.Add(i);
}
return a == b;//returns false

Now, this is actually a good thing. 现在,这实际上是一件好事。 Why? 为什么? Well, how do you know in the above that we want to consider a as equal to b? 那么,你怎么知道在上面我们要考虑a等于b? We might, but there are plenty of good reasons for not doing so in other cases too. 我们可能,但在其他情况下也有很多充分理由不这样做。

What's more, it's much easier to redefine equality from identity-based to value-based, than from value-based to identity-based. 更重要的是,从基于身份到基于价值的重新定义平等要容易得多,而不是从基于价值的转变为基于身份的平等。 Finally, there are more than one value-based definitions of equality for many objects (classic case being the different views on what makes a string equal), so there isn't even a one-and-only definition that works. 最后,对于许多对象,有多个基于值的相等定义(经典案例是关于什么使字符串相等的不同视图),因此甚至没有一个唯一的定义可行。 For example: 例如:

ArrayList c = new ArrayList();
for(short i = 0; i != 10; ++i)
{
  c.Add(i);
}

If we considered a == b above, should we consider a == c aslo? 如果我们考虑上面a == b ,我们应该考虑a == c aslo吗? The answer depends on just what we care about in the definition of equality we are using, so the framework could't know what the right answer is for all cases, since all cases don't agree. 答案取决于我们所使用的平等定义中我们关心的内容,因此框架无法知道所有案例的正确答案是什么,因为所有案例都不同意。

Now, if we do care about value-based equality in a given case we have two very easy options. 现在,如果我们在特定情况下关注基于价值的平等,我们有两个非常简单的选择。 The first is to subclass and over-ride equality: 第一个是子类化和覆盖平等:

public class ValueEqualList : ArrayList, IEquatable<ValueEqualList>
{
  /*.. most methods left out ..*/
  public Equals(ValueEqualList other)//optional but a good idea almost always when we redefine equality
  {
    if(other == null)
      return false;
    if(ReferenceEquals(this, other))//identity still entails equality, so this is a good shortcut
      return true;
    if(Count != other.Count)
      return false;
    for(int i = 0; i != Count; ++i)
      if(this[i] != other[i])
        return false;
    return true;
  }
  public override bool Equals(object other)
  {
    return Equals(other as ValueEqualList);
  }
  public override int GetHashCode()
  {
    int res = 0x2D2816FE;
    foreach(var item in this)
    {
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
  }
}

This assumes that we will always want to treat such lists this way. 这假设我们总是希望以这种方式处理这样的列表。 We can also implement an IEqualityComparer for a given case: 我们还可以为给定的案例实现IEqualityComparer:

public class ArrayListEqComp : IEqualityComparer<ArrayList>
{//we might also implement the non-generic IEqualityComparer, omitted for brevity
  public bool Equals(ArrayList x, ArrayList y)
  {
    if(ReferenceEquals(x, y))
      return true;
    if(x == null || y == null || x.Count != y.Count)
      return false;
    for(int i = 0; i != x.Count; ++i)
      if(x[i] != y[i])
        return false;
    return true;
  }
  public int GetHashCode(ArrayList obj)
  {
    int res = 0x2D2816FE;
    foreach(var item in obj)
    {
        res = res * 31 + (item == null ? 0 : item.GetHashCode());
    }
    return res;
  }
}

In summary: 综上所述:

  1. The default equality definition of a reference type is dependant upon identity alone. 引用类型的默认相等定义仅取决于标识。
  2. Most of the time, we want that. 大多数时候,我们都想要那样。
  3. When the person defining the class decides that this isn't what is wanted, they can override this behaviour. 当定义类的人决定这不是想要的时,他们可以覆盖这种行为。
  4. When the person using the class wants a different definition of equality again, they can use IEqualityComparer<T> and IEqualityComparer so their that dictionaries, hashmaps, hashsets, etc. use their concept of equality. 当使用该类的人再次想要不同的相等定义时,他们可以使用IEqualityComparer<T>IEqualityComparer因此他们的字典,哈希映射,哈希集等使用它们的相等概念。
  5. It's disastrous to mutate an object while it is the key to a hash-based structure. 改变对象是一个灾难性的,而它是基于散列的结构的关键。 Immutability can be used of ensure this doesn't happen, but is not compulsory, nor always desirable. 可以使用不变性来确保不会发生这种情况,但不是强制性的,也不总是可取的。

All in all, the framework gives us nice defaults and detailed override possibilities. 总而言之,该框架为我们提供了很好的默认值和详细的覆盖可能性。

*There is a bug in the case of a decimal within a struct, because there is a short-cut used in some cases with stucts when it is safe and not othertimes, but while a struct containing a decimal is one case when the short-cut is not safe, it is incorrectly identified as a case where it is safe. *在结构中有一个小数的情况下有一个错误,因为在某些情况下使用快捷方式时它是安全的而不是其他的,但是当包含小数的结构是短时间的一个结构时切割是不安全的,它被错误地识别为安全的情况。

It is not possible for a hashcode to be unique across all variations of most non-trivial classes. 哈希码不可能在大多数非平凡类的所有变体中都是唯一的。 In C# the concept of List equality is not the same as in Java (see here ), so the hash code implementation is also not the same - it mirrors the C# List equality. 在C#中,List相等的概念与Java中的概念不同(参见此处 ),因此哈希代码实现也不相同 - 它反映了C#List的相等性。

The core reasons are performance and human nature - people tend to think about hashes as something fast but it normally requires traversing all elements of an object at least once. 性能和人性的核心原因 - 人们倾向于将哈希视为快速的东西,但通常需要至少遍历一次对象的所有元素。

Example: If you use a string as a key in a hash table every query has complexity O(|s|) - use 2x longer strings and it will cost you at least twice as much. 示例:如果您使用字符串作为哈希表中的键,则每个查询都具有复杂度O(| s |) - 使用2x更长的字符串,它将花费您至少两倍的费用。 Imagine that it was a full blown tree (just a list of lists) - oops :-) 想象一下,它是一个完整的树(只是一个列表) - 哎呀:-)

If full, deep hash calculation was a standard operation on a collection, enormous percentage of progammers would just use it unwittingly and then blame the framework and the virtual machine for being slow. 如果完整的,深度哈希计算是对集合的标准操作,那么很大比例的程序员会在不知情的情况下使用它,然后将框架和虚拟机归咎于缓慢。 For something as expensive as full traversal it is crucial that a programmer has to be aware of the complexity. 对于像完全遍历一样昂贵的东西,程序员必须意识到复杂性是至关重要的。 The only was to achieve that is to make sure that you have to write your own. 唯一要实现的就是确保你必须自己编写。 It's a good deterrent as well :-) 这也是一个很好的威慑:-)

Another reason is updating tactics . 另一个原因是更新策略 Calculating and updating a hash on the fly vs. doing the full calculation every time requires a judgement call depending on concrete case in hand. 每次计算和更新散列与每次完整计算需要根据手头的具体情况进行判断调用。

Immutabilty is just an academic cop out - people do hashes as a way of detecting a change faster (file hashes for example) and also use hashes for complex structures which change all the time. Immutabilty只是一个学术警察 - 人们将哈希作为一种更快地检测变化的方式(例如文件哈希),并且还使用哈希来处理一直在变化的复杂结构。 Hash has many more uses beyong the 101 basics. Hash在101个基础知识中有更多用途。 The key is again that what to use for a hash of a complex object has to be a judgement call on a case by case basis. 关键在于,对于复杂对象的散列使用什么必须是逐个判断调用。

Using object's address (actually a handle so it doesn't change after GC) as a hash is actually the case where the hash value remains the same for arbitrary mutable object :-) The reason C# does it is that it's cheap and again nudges people to calculate their own. 使用对象的地址(实际上是一个句柄,因此它不会在GC之后更改)作为哈希实际上是哈希值对于任意可变对象保持相同的情况:-) C#的原因是它便宜并再次推动人们自己计算。

You're only partly wrong. 你只是部分错了。 You're definitely wrong when you think that equal hashcodes means equal objects, but equal objects must have equal hashcodes, which means that if the hashcodes differ, so do the objects. 当您认为相等的哈希码意味着相等的对象时,你肯定是错的,但是相等的对象必须具有相同的哈希码,这意味着如果哈希码不同,那么对象也是如此。

Why is too philosophical. 为什么太哲学了。 Create helper method (may be extension method) and calculate hashcode as you like. 创建辅助方法(可能是扩展方法)并根据需要计算哈希码。 May be XOR elements' hashcodes 可能是XOR元素的哈希码

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM