简体   繁体   English

如何从C#中的字节数组生成哈希码?

[英]How do I generate a hashcode from a byte array in C#?

Say I have an object that stores a byte array and I want to be able to efficiently generate a hashcode for it. 假设我有一个存储字节数组的对象,我希望能够为它有效地生成哈希码。 I've used the cryptographic hash functions for this in the past because they are easy to implement, but they are doing a lot more work than they should to be cryptographically oneway, and I don't care about that (I'm just using the hashcode as a key into a hashtable). 我过去曾经使用过加密哈希函数,因为它们很容易实现,但是他们做的工作比他们应该加密的工作要多得多,而且我并不关心(我只是在用它)哈希码作为哈希表的密钥)。

Here's what I have today: 这就是我今天所拥有的:

struct SomeData : IEquatable<SomeData>
{
    private readonly byte[] data;
    public SomeData(byte[] data)
    {
        if (null == data || data.Length <= 0)
        {
            throw new ArgumentException("data");
        }
        this.data = new byte[data.Length];
        Array.Copy(data, this.data, data.Length);
    }

    public override bool Equals(object obj)
    {
        return obj is SomeData && Equals((SomeData)obj);
    }

    public bool Equals(SomeData other)
    {
        if (other.data.Length != data.Length)
        {
            return false;
        }
        for (int i = 0; i < data.Length; ++i)
        {
            if (data[i] != other.data[i])
            {
                return false;
            }
        }
        return true;
    }
    public override int GetHashCode()
    {
        return BitConverter.ToInt32(new MD5CryptoServiceProvider().ComputeHash(data), 0);
    }
}

Any thoughts? 有什么想法吗?


dp: You are right that I missed a check in Equals, I have updated it. dp:你错了我的Equals支票是对的,我已经更新了。 Using the existing hashcode from the byte array will result in reference equality (or at least that same concept translated to hashcodes). 使用字节数组中的现有哈希码将导致引用相等(或者至少将相同的概念转换为哈希码)。 for example: 例如:

byte[] b1 = new byte[] { 1 };
byte[] b2 = new byte[] { 1 };
int h1 = b1.GetHashCode();
int h2 = b2.GetHashCode();

With that code, despite the two byte arrays having the same values within them, they are referring to different parts of memory and will result in (probably) different hash codes. 使用该代码,尽管两个字节数组在其中具有相同的值,但它们指的是存储器的不同部分并且将导致(可能)不同的哈希代码。 I need the hash codes for two byte arrays with the same contents to be equal. 我需要具有相同内容的两个字节数组的哈希码相等。

The hash code of an object does not need to be unique. 对象的哈希码不需要是唯一的。

The checking rule is: 检查规则是:

  • Are the hash codes equal? 哈希码是否相等? Then call the full (slow) Equals method. 然后调用完整(慢) Equals方法。
  • Are the hash codes not equal? 哈希码不相等吗? Then the two items are definitely not equal. 那两个项目绝对不相等。

All you want is a GetHashCode algorithm that splits up your collection into roughly even groups - it shouldn't form the key as the HashTable or Dictionary<> will need to use the hash to optimise retrieval. 你想要的只是一个GetHashCode算法,它将你的集合分成大致均匀的组 - 它不应该形成密钥,因为HashTableDictionary<>需要使用哈希来优化检索。

How long do you expect the data to be? 你期望数据有多长? How random? 怎么随机? If lengths vary greatly (say for files) then just return the length. 如果长度变化很大(比如文件),那么只返回长度。 If lengths are likely to be similar look at a subset of the bytes that varies. 如果长度可能相似,则查看变化的字节子集。

GetHashCode should be a lot quicker than Equals , but doesn't need to be unique. GetHashCode应该比Equals快很多,但不需要是唯一的。

Two identical things must never have different hash codes. 两个相同的东西绝不能有不同的哈希码。 Two different objects should not have the same hash code, but some collisions are to be expected (after all, there are more permutations than possible 32 bit integers). 两个不同的对象不应该具有相同的哈希码,但是可以预期一些冲突(毕竟,存在比可能的32位整数更多的排列)。

Don't use cryptographic hashes for a hashtable, that's ridiculous/overkill. 不要使用加密哈希作为哈希表,这是荒谬/过度的。

Here ya go... Modified FNV Hash in C# 在这里你去...在C#修改FNV哈希

http://bretm.home.comcast.net/hash/6.html http://bretm.home.comcast.net/hash/6.html

    public static int ComputeHash(params byte[] data)
    {
        unchecked
        {
            const int p = 16777619;
            int hash = (int)2166136261;

            for (int i = 0; i < data.Length; i++)
                hash = (hash ^ data[i]) * p;

            hash += hash << 13;
            hash ^= hash >> 7;
            hash += hash << 3;
            hash ^= hash >> 17;
            hash += hash << 5;
            return hash;
        }
    }

Borrowing from the code generated by JetBrains software, I have settled on this function: 借用JetBrains软件生成的代码,我已经确定了这个功能:

    public override int GetHashCode()
    {
        unchecked
        {
            var result = 0;
            foreach (byte b in _key)
                result = (result*31) ^ b;
            return result;
        }
    }

The problem with just XOring the bytes is that 3/4 (3 bytes) of the returned value has only 2 possible values (all on or all off). 仅仅XOring字节的问题是返回值的3/4(3字节)只有2个可能的值(全部打开或全部关闭)。 This spreads the bits around a little more. 这会将比特分散一点。

Setting a breakpoint in Equals was a good suggestion. 在Equals中设置断点是一个很好的建议。 Adding about 200,000 entries of my data to a Dictionary, sees about 10 Equals calls (or 1/20,000). 将大约200,000个我的数据条目添加到字典中,可以看到大约10个等于呼叫(或1 / 20,000)。

Have you compared with the SHA1CryptoServiceProvider.ComputeHash method? 您是否与SHA1CryptoServiceProvider.ComputeHash方法进行了比较 It takes a byte array and returns a SHA1 hash, and I believe it's pretty well optimized. 它需要一个字节数组并返回一个SHA1哈希,我相信它已经很好地优化了。 I used it in an Identicon Handler that performed pretty well under load. 我在一个在负载下表现相当好的Identicon Handler中使用它。

I found interesting results: 我发现了有趣的结果:

I have the class: 我上课了:

public class MyHash : IEquatable<MyHash>
{        
    public byte[] Val { get; private set; }

    public MyHash(byte[] val)
    {
        Val = val;
    }

    /// <summary>
    /// Test if this Class is equal to another class
    /// </summary>
    /// <param name="other"></param>
    /// <returns></returns>
    public bool Equals(MyHash other)
    {
        if (other.Val.Length == this.Val.Length)
        {
            for (var i = 0; i < this.Val.Length; i++)
            {
                if (other.Val[i] != this.Val[i])
                {
                    return false;
                }
            }

            return true;
        }
        else
        {
            return false;
        }            
    }

    public override int GetHashCode()
    {            
        var str = Convert.ToBase64String(Val);
        return str.GetHashCode();          
    }
}

Then I created a dictionary with keys of type MyHash in order to test how fast I can insert and I can also know how many collisions there are. 然后我创建了一个包含MyHash类型键的字典,以测试我可以插入的速度,我也可以知道有多少碰撞。 I did the following 我做了以下

        // dictionary we use to check for collisions
        Dictionary<MyHash, bool> checkForDuplicatesDic = new Dictionary<MyHash, bool>();

        // used to generate random arrays
        Random rand = new Random();



        var now = DateTime.Now;

        for (var j = 0; j < 100; j++)
        {
            for (var i = 0; i < 5000; i++)
            {
                // create new array and populate it with random bytes
                byte[] randBytes = new byte[byte.MaxValue];
                rand.NextBytes(randBytes);

                MyHash h = new MyHash(randBytes);

                if (checkForDuplicatesDic.ContainsKey(h))
                {
                    Console.WriteLine("Duplicate");
                }
                else
                {
                    checkForDuplicatesDic[h] = true;
                }
            }
            Console.WriteLine(j);
            checkForDuplicatesDic.Clear(); // clear dictionary every 5000 iterations
        }

        var elapsed = DateTime.Now - now;

        Console.Read();

Every time I insert a new item to the dictionary the dictionary will calculate the hash of that object. 每次我将新项目插入字典时,字典将计算该对象的哈希值。 So you can tell what method is most efficient by placing several answers found in here in the method public override int GetHashCode() The method that was by far the fastest and had the least number of collisions was: 因此,您可以通过在方法中找到此处找到的几个答案来告诉最有效的方法public override int GetHashCode()到目前为止最快且碰撞次数最少的方法是:

    public override int GetHashCode()
    {            
        var str = Convert.ToBase64String(Val);
        return str.GetHashCode();          
    }

that took 2 seconds to execute. 需要2秒才能执行。 The method 方法

    public override int GetHashCode()
    {
        // 7.1 seconds
        unchecked
        {
            const int p = 16777619;
            int hash = (int)2166136261;

            for (int i = 0; i < Val.Length; i++)
                hash = (hash ^ Val[i]) * p;

            hash += hash << 13;
            hash ^= hash >> 7;
            hash += hash << 3;
            hash ^= hash >> 17;
            hash += hash << 5;
            return hash;
        }
    }

had no collisions also but it took 7 seconds to execute! 也没有碰撞,但执行需要7秒!

If you are looking for performance, I tested a few hash keys, and I recommend Bob Jenkin's hash function . 如果你正在寻找性能,我测试了一些哈希键,我推荐了Bob Jenkin的哈希函数 It is both crazy fast to compute and will give as few collisions as the cryptographic hash you used until now. 它既疯狂又快速计算,并且会产生与您迄今使用的加密哈希一样少的冲突。

I don't know C# at all, and I don't know if it can link with C, but here is its implementation in C . 我根本不知道C#,我不知道它是否可以与C链接,但这是它在C中的实现

Is using the existing hashcode from the byte array field not good enough? 是否使用字节数组字段中的现有哈希码不够好? Also note that in the Equals method you should check that the arrays are the same size before doing the compare. 另请注意,在Equals方法中,您应该在执行比较之前检查数组的大小是否相同。

Generating a good hash is easier said than done. 生成好的哈希说起来容易做起来难。 Remember, you're basically representing n bytes of data with m bits of information. 请记住,您基本上用m位信息表示n个字节的数据。 The larger your data set and the smaller m is, the more likely you'll get a collision ... two pieces of data resolving to the same hash. 数据集越大,m越小,就越有可能发生冲突......两段数据解析为相同的哈希值。

The simplest hash I ever learned was simply XORing all the bytes together. 我学到的最简单的哈希只是将所有字节一起异或。 It's easy, faster than most complicated hash algorithms and a halfway decent general-purpose hash algorithm for small data sets. 它比大多数复杂的哈希算法更容易,更快,并且对于小数据集来说是中等的通用哈希算法。 It's the Bubble Sort of hash algorithms really. 这真的是散列算法的冒泡排序。 Since the simple implementation would leave you with 8 bits, that's only 256 hashes ... not so hot. 由于简单的实现会留下8位,这只有256个哈希......不是那么热。 You could XOR chunks instead of individal bytes, but then the algorithm gets much more complicated. 您可以使用XOR块而不是单个字节,但随后算法变得更加复杂。

So certainly, the cryptographic algorithms are maybe doing some stuff you don't need ... but they're also a huge step up in general-purpose hash quality. 当然,加密算法可能正在做一些你不需要的东西......但它们在通用哈希质量方面也是一个巨大的进步。 The MD5 hash you're using has 128 bits, with billions and billions of possible hashes. 您正在使用的MD5哈希值为128位,具有数十亿和数十亿个可能的哈希值。 The only way you're likely to get something better is to take some representative samples of the data you expect to be going through your application and try various algorithms on it to see how many collisions you get. 您可能获得更好的方法的唯一方法是获取您希望通过应用程序的数据的一些代表性样本,并尝试使用各种算法来查看您获得的冲突数。

So until I see some reason to not use a canned hash algorithm (performance, perhaps?), I'm going to have to recommend you stick with what you've got. 所以,直到我看到一些不使用罐装哈希算法的原因(性能,也许?),我将不得不建议你坚持你所拥有的。

Whether you want a perfect hashfunction (different value for each object that evaluates to equal) or just a pretty good one is always a performance tradeoff, it takes normally time to compute a good hashfunction and if your dataset is smallish you're better of with a fast function. 无论你想要一个完美的散列函数(每个评估为相等的对象的不同值)还是只是一个非常好的散列函数总是一个性能权衡,它需要时间来计算一个好的散列函数,如果你的数据集很小,那你就更好了一个快速的功能。 The most important (as your second post points out) is correctness, and to achieve that all you need is to return the Length of the array. 最重要的(正如你的第二篇文章指出的那样)是正确的,并且要实现所有你需要的是返回数组的长度。 Depending on your dataset that might even be ok. 取决于您的数据集,甚至可能没问题。 If it isn't (say all your arrays are equally long) you can go with something cheap like looking at the first and last value and XORing their values and then add more complexity as you see fit for your data. 如果不是(比如所有数组都长度相同),你可以选择廉价的东西,比如查看第一个和最后一个值,然后对它们的值进行异或,然后根据您的数据添加更多的复杂性。

A quick way to see how your hashfunction performs on your data is to add all the data to a hashtable and count the number of times the Equals function gets called, if it is too often you have more work to do on the function. 快速查看哈希函数如何对数据执行操作的方法是将所有数据添加到哈希表中,并计算调用Equals函数的次数,如果太频繁,则需要对该函数执行更多操作。 If you do this just keep in mind that the hashtable's size needs to be set bigger than your dataset when you start, otherwise you are going to rehash the data which will trigger reinserts and more Equals evaluations (though possibly more realistic?) 如果你这样做,请记住,哈希表的大小需要在启动时设置得比数据集大,否则你将重新发送数据,这将触发重新插入和更多等于评估(虽然可能更现实?)

For some objects (not this one) a quick HashCode can be generated by ToString().GetHashCode(), certainly not optimal, but useful as people tend to return something close to the identity of the object from ToString() and that is exactly what GetHashcode is looking for 对于某些对象(不是这一个),可以通过ToString()生成一个快速的HashCode.GetHashCode(),当然不是最优的,但是很有用,因为人们倾向于从ToString()返回接近对象标识的东西,这正是GetHashcode正在寻找什么

Trivia: The worst performance I have ever seen was when someone by mistake returned a constant from GetHashCode, easy to spot with a debugger though, especially if you do lots of lookups in your hashtable 琐事:我见过的最糟糕的表现是当有人错误地从GetHashCode返回一个常量时,很容易发现调试器,特别是如果你在哈希表中做了很多查找

RuntimeHelpers.GetHashCode might help: RuntimeHelpers.GetHashCode可能会有所帮助:

From Msdn: 来自Msdn:

Serves as a hash function for a particular type, suitable for use in hashing algorithms and data structures such as a hash table. 用作特定类型的散列函数,适用于散列算法和数据结构(如散列表)。

private int? hashCode;

public override int GetHashCode()
{
    if (!hashCode.HasValue)
    {
        var hash = 0;
        for (var i = 0; i < bytes.Length; i++)
        {
            hash = (hash << 4) + bytes[i];
        }
        hashCode = hash;
    }
    return hashCode.Value;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM