简体   繁体   English

GetHashCode和存储桶

[英]GetHashCode and Buckets

I am trying to get a better understanding how the internas of hashed sets, eg HashSet<T> do work and why they are performant. 我试图更好地了解散列集(例如HashSet<T>如何工作以及它们为何表现出色。 I discovered following article, implementing a simple example with a bucket list http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/ . 我发现了以下文章,使用存储桶列表http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/实现了一个简单示例。

As far as I understand this article (and I also thought that way before), the bucket list itself groups certain amount of elements in each bucket. 据我对本文的理解(我之前也曾这么认为),存储桶列表本身将每个存储桶中的一定数量的元素分组。 One bucket is represented by the hashcode, namely by GetHashCode which is called on the element. 一个存储桶由哈希码表示,即由在元素上调用的GetHashCode表示。 I thought the better performance is based on the fact that there are less buckets than elements. 我认为更好的性能是基于以下事实:存储桶少于元素。

Now I have written following naive test-code: 现在,我编写了以下朴素的测试代码:

    public class CustomHashCode
    {
        public int Id { get; set; }

        public override int GetHashCode()
        {
            //return Id.GetHashCode(); // Way better performance
            return Id % 40; // Bad performance! But why?
        }


        public override bool Equals(object obj)
        {
            return ((CustomHashCode) obj).Id == Id;
        }

    }

And here the profiler: 这是探查器:

    public static void TestNoCustomHashCode(int iterations)
    {

        var hashSet = new HashSet<NoCustomHashCode>();
        for (int j = 0; j < iterations; j++)
        {
            hashSet.Add(new NoCustomHashCode() { Id = j });
        }

        var chc = hashSet.First();
        var stopwatch = new Stopwatch();
        stopwatch.Start();
        for (int j = 0; j < iterations; j++)
        {
            hashSet.Contains(chc);
        }
        stopwatch.Stop();

        Console.WriteLine(string.Format("Elapsed time (ms): {0}", stopwatch.ElapsedMilliseconds));
    }

My naive thought was: Let's reduce the amount of buckets (with a simple modulo), that should increase performance. 我的天真想法是:让我们减少存储桶的数量(使用简单的模),这可以提高性能。 But it is terrible (on my system it takes about 4 seconds with 50000 iterations). 但这是可怕的(在我的系统上,迭代5万次大约需要4秒钟)。 I also thought if I simply return the Id as hashcode, performance should be poor since I would end up with 50000 buckets. 我还认为,如果我只是将Id作为哈希码返回,则性能会很差,因为最终会得到50000个存储桶。 But the opposite is the case, I guess I simply produced tones of so called collisions instead of improving anything. 但是情况恰恰相反,我想我只是产生了所谓的碰撞声,而不是改善任何东西。 But then again, how do the bucket lists work? 但是话又说回来,存储桶列表如何工作?

A Contains check basically: A Contains检查基本上:

  1. Gets the hashcode of the item. 获取项目的哈希码。
  2. Finds the corresponding bucket - this is a direct array lookup based on the hashcode of the item. 查找相应的存储桶-这是基于项目的哈希码的直接数组查找。
  3. If the bucket exists, tries to find the item in the bucket - this iterates over all the items in the bucket. 如果存储桶存在,请尝试在存储桶中查找项目-遍历存储桶中的所有项目。

By restricting the number of buckets, you've increased the number of items in each bucket, and thus the number of items that the hashset must iterate through, checking for equality, in order to see if an item exists or not. 通过限制存储桶的数量,您增加了每个存储桶中的项目数量,从而增加了哈希集必须迭代通过的项目数量,以检查是否相等,以查看某个项目是否存在。 Thus it takes longer to see if a given item exists. 因此,需要更长的时间才能查看给定的项目是否存在。

You've probably decreased the memory footprint of the hashset; 您可能已经减少了哈希集的内存占用; you may even have decreased the insertion time, although I doubt it. 你甚至可能减少插入时间,但我对此表示怀疑。 You haven't decreased the existence-check time. 您尚未减少存在检查时间。

Reducing the number of buckets will not increase the performance. 减少存储桶数量不会提高性能。 Actually, the GetHashCode method of Int32 returns the integer value itself, which is ideal for the performance as it will produce as many buckets as possible. 实际上, Int32GetHashCode方法本身返回整数值,这对于性能而言是理想的,因为它将产生尽可能多的存储桶。

The thing that gives a hash table performance, is the conversion from the key to the hash code, which means that it can quickly elliminate most of the items in the collection. 赋予哈希表性能的是密钥到哈希码的转换,这意味着它可以快速消除集合中的大多数项目。 The only items it has to consider is the ones in the same bucket. 它必须考虑的唯一项目是同一存储桶中的项目。 If you have few buckets, it means that it can elliminate a lot fewer items. 如果您的水桶很少,则意味着它可以淘汰少得多的物品。

The worst possible implementation of GetHashCode will cause all items to go in the same bucket: 最糟糕的GetHashCode实现将导致所有项目进入同一存储桶:

public override int GetHashCode() {
  return 0;
}

This is still a valid implementation, but it means that the hash table gets the same performance as a regular list, ie it has to loop through all items in the collection to find a match. 这仍然是有效的实现,但是这意味着哈希表具有与常规列表相同的性能,即,它必须遍历集合中的所有项目以找到匹配项。

A simple HashSet<T> could be implemented like this(just a sketch, doesn't compile) 一个简单的HashSet<T>可以这样实现(只是一个草图,不会编译)

class HashSet<T>
{
    struct Element
    {
        int Hash;
        int Next;
        T item;
    }

    int[] buckets=new int[Capacity];
    Element[] data=new Element[Capacity];

    bool Contains(T item)
    {
        int hash=item.GetHashCode();
        // Bucket lookup is a simple array lookup => cheap
        int index=buckets[(uint)hash%Capacity];
        // Search for the actual item is linear in the number of items in the bucket
        while(index>=0)
        {
           if((data[index].Hash==hash) && Equals(data[index].Item, item))
             return true;
           index=data[index].Next;          
        }
        return false;
    }
}

If you look at this, the cost of searching in Contains is proportional to the number of items in the bucket. 如果您查看此内容,则在Contains中搜索的成本与存储桶中的项目数成正比。 So having more buckets makes the search cheaper, but once the number of buckets exceeds the number of items, the gain of additional buckets quickly diminishes. 因此,拥有更多的存储桶会使搜索更加便宜,但是一旦存储桶数量超过了商品数量,其他存储桶的收益就会迅速减少。

Having diverse hashcodes also serves as early out for comparing objects within a bucket, avoiding potentially costly Equals calls. 具有不同的哈希码还可以作为比较存储桶中对象的早期方法,从而避免了潜在的昂贵Equals调用。

In short GetHashCode should be as diverse as possible. 简而言之, GetHashCode应该尽可能多样化。 It's the job of HashSet<T> to reduce that large space to an appropriate number of buckets, which is approximately the number of items in the collection (Typically within a factor of two). HashSet<T>的工作是将大空间减少到适当数量的存储桶,这大约是集合中项目的数量(通常在两倍之内)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM