简体   繁体   English

使用自定义Equals()和GetHashCode()的Dictionary的最佳性能

[英]Optimal performance of Dictionary with custom Equals() and GetHashCode()

So I need to create a dictionary with keys that are objects with a custom Equals() function. 因此,我需要创建一个带有键的字典,这些键是带有自定义Equals()函数的对象。 I discovered I need to override GetHashCode() too. 我发现我也需要重写GetHashCode()。 I heard that for optimal performance you should have hash codes that don't collide, but that seems counter intuitive. 我听说为了获得最佳性能,您应该使用不会冲突的哈希码,但这似乎与直觉相反。 I might be misunderstanding it, but it seems the entire point of using hash codes is to group items into buckets and if the hash codes never collide each bucket will only have 1 item which seems to defeat the purpose. 我可能会误会它,但似乎使用哈希码的全部目的是将项目分组到存储桶中,如果哈希码从不冲突,则每个存储桶将只有1个项目,这似乎无法达到目的。

So should I intentionally make my hash codes collide occasionally? 那么我是否应该故意使我的哈希码偶尔发生冲突? Performance is important. 性能很重要。 This will be a dictionary that will probably grow to multiple million items and I'll be doing lookups very often. 这将是一本字典,可能会增长到数百万个项目,我将经常进行查找。

The goal of a hash code is to give you an index into an array, each of which is a bucket that may contain zero, one, or more items. 哈希码的目标是为您提供一个数组索引,每个数组都是一个可以包含零个,一个或多个项目的存储桶。 The performance of the lookup then is dependent on the number of elements in the bucket. 然后,查询的性能取决于存储桶中元素的数量。 The fewer the better, since once you're in the bucket, it's an O(n) search (where n is the number of elements in the bucket). 越少越好,因为一旦进入存储桶,便是O(n)搜索(其中n是存储桶中的元素数)。 Therefore, it's ideal if the hashcode prevents collisions as much as possible, allowing for the optimal O(1) time as much as possible. 因此,理想的是,哈希码尽可能地防止冲突,并尽可能地延长最佳O(1)时间。

Dictionaries store data in buckets but there isn't one bucket for each hashcode. 字典将数据存储在存储桶中,但每个哈希码都没有一个存储桶。 The number of buckets is based on the capacity. 桶数取决于容量。 Values are put into buckets based on the modulus of the hashcode and number of buckets. 根据哈希码的模数和存储桶数将值放入存储桶。

Lets say you have a GetHashCode() method that produces these hash codes for five objects: 假设您有一个GetHashCode()方法,可为五个对象生成这些哈希码:

925
10641
14316
17213
28624

Hash codes should be spread out. 哈希码应散布。 So these look spread out, right? 这样看起来就散开了吧? If we have 7 buckets, then we end up calculating the modulus of each which gives us: 如果我们有7个存储桶,那么最终我们将计算每个存储桶的模数,从而得出:

1
1
1
0
1

So we end up with buckets: 因此,我们最终得到了水桶:

0 - 1 item
1 - 4 items
2 - 0 items
3 - 0 items
4 - 0 items
5 - 0 items
6 - 0 items

oops, not so well spread out now. 哎呀,现在还没那么好散开。

This is not made up data. 这不是组成数据。 These are actual hash codes. 这些是实际的哈希码。

Here's a sample of how to generate a hash code from contained data (not the formula used for the above hash codes, a better one). 这是有关如何从包含的数据中生成哈希码的示例(不是用于上面的哈希码的公式,更好的一种)。

https://stackoverflow.com/a/263416/118703 https://stackoverflow.com/a/263416/118703

You must ensure that the following holds: 必须确保以下条件成立:

(GetHashCode(a) != GetHashCode(b)) => !Equals(a, b)

The reverse implication is identical in meaning: 反向含义是相同的:

Equals(a, b) => (GetHashCode(a) == GetHashCode(b))

Apart from that, generate as few collisions as possible. 除此之外,产生尽可能少的碰撞。 A collision is defined as: 冲突定义为:

(GetHashCode(a) == GetHashCode(b)) && !Equals(a, b)

A collision does not affect correctness, but performance. 碰撞不会影响正确性,但会影响性能。 GetHashCode always returning zero would be correct for example, but slow. 例如,总是返回零的GetHashCode是正确的,但是很慢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM