简体   繁体   English

快速创建32位哈希码,以唯一地标识由(主要)原始值组成的结构

[英]Quickly creating 32 bit hash code uniquely identifying a struct composed of (mostly) primitive values

EDIT: 64 or 128 bit would also work. 编辑:64或128位也将工作。 My brain just jumped to 32bit for some reason, thinking it would be sufficient. 由于某种原因,我的大脑刚刚跳到32bit,以为就足够了。

I have a struct that is composed of mostly numeric values (int, decimal), and 3 strings that are never more than 12 alpha-characters each. 我有一个结构,该结构主要由数字值(int,十进制)和3个字符串组成,每个字符串都不超过12个字母字符。 I'm trying to create an integer value that will work as a hash code, and trying to create it quickly. 我正在尝试创建一个将用作哈希码的整数值,并试图快速创建它。 Some of the numeric values are also nullable. 一些数值也可以为空。

It seems like BitVector32 or BitArray would be useful entities for use in this endevor, but I'm just not sure how to bend them to my will in this task. 看来BitVector32或BitArray将是在此功能中使用的有用实体,但我不确定在此任务中如何使它们适应我的意愿。 My struct contains 3 strings, 12 decimals (7 of which are nullable), and 4 ints. 我的结构包含3个字符串,12个小数(其中7个可为空)和4个整数。

To simplify my use case, lets say you have the following struct: 为了简化我的用例,可以说您具有以下结构:

public struct Foo
{
    public decimal MyDecimal;
    public int? MyInt;
    public string Text;
}

I know I can get numeric identifiers for each value. 我知道我可以获得每个值的数字标识符。 MyDecimal and MyInt are of course unique, from a numerical standpoint. 从数字的角度来看,MyDecimal和MyInt当然是唯一的。 And the string has a GetHashCode() function which will return a usually-unique value. 并且该字符串具有GetHashCode()函数,该函数将返回通常唯一的值。

So, with a numeric identifier for each, is it possible to generate a hash code that uniquely identifies this structure? 因此,通过每个数字标识符,是否可以生成唯一标识此结构的哈希码? eg I can compare 2 different Foo's containing the same values, and get the same Hash Code, every time (regardless of app domain, restarting the app, time of day, alignment of Jupiters moons, etc). 例如,我可以每次比较两个包含相同值的不同Foo,并获得相同的哈希码(与应用程序域无关,重新启动应用程序,一天中的时间,对准木星卫星等)。

The hash would be sparse, so I don't anticipate collisions from my use cases. 哈希将是稀疏的,因此我不会从用例中预测到冲突。

Any ideas? 有任何想法吗? My first run at it I converted everything to a string representation, concated it, and used the built-in GetHashCode() but that seems terribly ... inefficient. 第一次运行时,我将所有内容都转换为字符串表示形式,对其进行了缩略,并使用了内置的GetHashCode(),但这似乎非常......效率低下。

EDIT: A bit more background information. 编辑:更多背景信息。 The structure data is being delivered to a webclient, and the client does a lot of computation of included values, string construction, etc to re-render the page. 结构数据正在传递给Web客户端,客户端对包含的值,字符串构造等进行了大量计算,以重新呈现页面。 The aforementioned 19 field structure represent a single unit of information, each page could have many of units. 前面提到的19个字段结构表示一个信息单元,每个页面可以具有许多单元。 I'd like to do some client-side caching of the rendered result, so I can quickly re-render a unit without recomputing on the client side if I see the same hash identifier from the server. 我想对渲染结果进行一些客户端缓存,因此,如果我从服务器看到相同的哈希标识符,则可以快速重新渲染单元,而无需在客户端进行重新计算。 JavaScript numeric values are all 64 bit, so I suppose my 32bit constraint is artificial and limiting. JavaScript数值都是64位,所以我想我的32位约束是人为的和限制性的。 64 bit would work, or I suppose even 128 bit if I can break it into two 64 bit values on the server. 64位可以工作,或者如果我可以将其分解为服务器上的两个64位值,我想甚至是128位。

Well, even in a sparse table one should better be prepared for collisions, depending on what "sparse" means. 好吧,即使在稀疏表中,也应根据“稀疏”的含义为碰撞做好更好的准备。

哈希冲突概率(均匀分布)

You would need to be able to make very specific assumptions about the data you will be hashing at the same time to beat this graph with 32 bits. 您将需要能够对将要同时进行哈希处理的数据做出非常具体的假设,以便以32位击败该图表。

Go with SHA256. 使用SHA256。 Your hashes will not depend on CLR version and you will have no collisions. 您的哈希将不取决于CLR版本,并且不会发生冲突。 Well, you will still have some, but less frequently than meteorite impacts, so you can afford not anticipating any. 好吧,您仍然会受到一些但比陨石撞击少的撞击,因此您可以承受任何意外。

Two things I suggest you take a look at here and here . 我建议您在这里这里看两件事。 I don't think you'll be able to GUARANTEE no collisions with just 32 bits. 我认为仅32位就无法保证没有冲突。

Hash codes by definition of a hash function are not meant to be unique. 哈希函数定义中的哈希代码并非唯一。 They are only meant to be as evenly distributed across all result values as possible. 它们仅应尽可能均匀地分布在所有结果值中。 Getting a hash code for an object is meant to be a quick way to check if two objects are different . 获取对象的哈希码是一种检查两个对象是否不同快速方法。 If hash codes for two objects are different then those objects are different. 如果两个对象的哈希码不同,则这些对象也不同。 But if hash codes are the same you have to deeply compare the objects to be be sure. 但是,如果哈希码相同,则必须对对象进行深入比较才能确定。 Hash codes main usage is in all hash-based collections where they make it possible for nearly O(1) retrieval speed. 散列码的主要用法是在所有基于散列的集合中,它们使接近O(1)的检索速度成为可能。

So in this light, your GetHashCode does not have to be complex and in fact it shouldn't. 因此,从这个角度来看,您的GetHashCode不必很复杂,实际上也不必那么复杂。 It must be balanced between being very quick and producing evenly distributed values. 它必须在非常快速和产生均匀分布的值之间取得平衡。 If it takes too long to get a hash code it makes it pointless because advantage over deep compare is gone. 如果获取哈希码花费的时间太长,它将变得毫无意义,因为深度比较的优势已经消失了。 If on the other extreme end, hash code would always be 1 for example (lighting fast) it would lead to deep compare in every case which makes this hash code pointless too. 如果在另一个极端,例如,哈希码将始终为1 (快速点亮),则将导致在每种情况下的深度比较,这也使该哈希码毫无意义。

So get the balance right and don't try to come up with a perfect hash code. 因此,请保持平衡,不要尝试提出完美的哈希码。 Call GetHashCode on all (or most) of your members and combine the results using Xor operator maybe with a bitwise shift operator << or >> . 在所有(或大多数)成员上调用GetHashCode ,并使用Xor运算符(可能与按位移位运算符<<>> )组合结果。 Framework types have GetHashCode quite optimized although they are not guaranteed to be the same in each application run. 框架类型对GetHashCode非常优化,尽管不能保证每次运行的应用程序都相同。 There is no guarantee but they also do not have to change and a lot of them don't. 虽然不能保证,但也不必更改,而且很多都不需要更改。 Use a reflector to make sure or create your own versions based on the reflected code. 使用反射器来确保或基于反射的代码创建自己的版本。

In your particular case deciding if you have already processed a structure by just looking at its hash code is a bit risky. 在您的特定情况下,仅通过查看其哈希码来确定是否已经处理过结构会有些冒险。 The better the hash the smaller the risk but still. 哈希越好,风险就越小,但仍然如此。 The ultimate and only unique hash code is... the data itself. 最终唯一的哈希码是……数据本身。 When working with hash codes you must also override Object.Equals for your code to be truly reliable. 使用哈希码时,还必须重写Object.Equals才能使代码真正可靠。

I believe the usual method in .NET is to call GetHashCode on each member of the structure and xor the results. 我相信.NET中通常的方法是在结构的每个成员上调用GetHashCode并对结果进行异或。

However, I don't think GetHashCode claims to produce the same hash for the same value in different app domains. 但是,我不认为GetHashCode声称可以在不同的应用程序域中为相同的值生成相同的哈希。

Could you give a bit more information in your question about why you want this hash value and why it needs to be stable over time, different app domains etc. 您能否在问题中提供更多信息,说明为什么要使用此哈希值以及为什么它需要随着时间的推移以及不同的应用程序域等而保持稳定。

What goal are you after? 你追求什么目标? If it is performance then you should use a class since a struct will be copied by value whenever you pass it as a function parameter. 如果这是性能,那么您应该使用一个类,因为每当您将其作为函数参数传递时,结构都将按值复制。

3 strings, 12 decimals (7 of which are nullable), and 4 ints. 3个字符串,12个小数(其中7个可为空)和4个整数。

On a 64 bit machine a pointer will be 8 bytes in size a decimal takes 16 bytes and an int 4 bytes. 在64位计算机上,指针的大小为8个字节,十进制为16个字节,整数为4个字节。 Ignoring padding your struct will use 232 bytes per instance . 忽略填充的结构将每个实例使用232字节 This is much bigger compared to the recommened maximum of 16 bytes which makes sense perf wise (classes take up at least 16 bytes due to its object header, ...) 与建议的最大16个字节相比,这要大得多,后者非常合理(由于其对象标头,类至少占用16个字节,...)

If you need a fingerprint of the value you can use a cryptographically grade hash algo like SHA256 which will produce a 16 byte fingerprint. 如果您需要该值的指纹,则可以使用像SHA256这样的加密级哈希算法,它将产生一个16字节的指纹。 This is still not uniqe but at least unique enough. 这仍然不是唯一的,但至少足够独特。 But this will cost quite some performance as well. 但是,这也会花费很多性能。

Edit1: After you made clear that you need the hash code to identify the object in a Java Script web client cache I am confused. Edit1:在明确需要哈希码来标识Java Script Web客户端高速缓存中的对象之后,我感到困惑。 Why does the server send the same data again? 为什么服务器再次发送相同的数据? Would it not be simpler to make the server smarter to send only data the client has not yet received? 使服务器更智能地仅发送客户端尚未接收到的数据会更简单吗?

A SHA hash algo could be ok in your case to create some object instance tag. 在您的情况下,可以使用SHA哈希算法来创建一些对象实例标签。


Why do you need a hash code at all? 为什么根本需要哈希码? If your goal is to store the values in a memory efficient manner you can create a FooList which uses dictionaries to store identical values only once and uses and int as lookup key. 如果您的目标是以内存有效的方式存储值,则可以创建FooList,该FooList使用字典仅将相同的值存储一次,并使用和int作为查找键。

using System;
using System.Collections.Generic;

namespace MemoryEfficientFoo
{
    class Foo // This is our data structure 
    {
        public int A;
        public string B;
        public Decimal C;
    }

    /// <summary>
    /// List which does store Foos with much less memory if many values are equal. You can cut memory consumption by factor 3 or if all values 
    /// are different you consume 5 times as much memory as if you would store them in a plain list! So beware that this trick
    /// might not help in your case. Only if many values are repeated it will save memory.
    /// </summary>
    class FooList : IEnumerable<Foo> 
    {
        Dictionary<int, string> Index2B = new Dictionary<int, string>();
        Dictionary<string, int> B2Index = new Dictionary<string, int>();

        Dictionary<int, Decimal> Index2C = new Dictionary<int, decimal>();
        Dictionary<Decimal,int> C2Index = new Dictionary<decimal,int>();

        struct FooIndex
        {
            public int A;
            public int BIndex;
            public int CIndex;
        }

        // List of foos which do contain only the index values to the dictionaries to lookup the data later.
        List<FooIndex> FooValues = new List<FooIndex>();

        public void Add(Foo foo)
        {
            int bIndex;
            if(!B2Index.TryGetValue(foo.B, out bIndex))
            {
                bIndex = B2Index.Count;
                B2Index[foo.B] = bIndex;
                Index2B[bIndex] = foo.B;
            }

            int cIndex;
            if (!C2Index.TryGetValue(foo.C, out cIndex))
            {
                cIndex = C2Index.Count;
                C2Index[foo.C] = cIndex;
                Index2C[cIndex] = cIndex;
            }

            FooIndex idx = new FooIndex
            {
                A = foo.A,
                BIndex = bIndex,
                CIndex = cIndex
            };

            FooValues.Add(idx);
        }

        public Foo GetAt(int pos)
        {
            var idx = FooValues[pos];
            return new Foo
            {
                A = idx.A,
                B = Index2B[idx.BIndex],
                C = Index2C[idx.CIndex]
            };
        }

        public IEnumerator<Foo> GetEnumerator()
        {
            for (int i = 0; i < FooValues.Count; i++)
            {
                yield return GetAt(i);
            }
        }
        System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
        {
            return GetEnumerator();
        }
    }


    class Program
    {
        static void Main(string[] args)
        {
            FooList list = new FooList();
            List<Foo> fooList = new List<Foo>();
            long before = GC.GetTotalMemory(true);
            for (int i = 0; i < 1000 * 1000; i++)
            {
                list
                //fooList
                    .Add(new Foo
                    {
                        A = i,
                        B = "Hi",
                        C = i
                    });

            }
            long after = GC.GetTotalMemory(true);
            Console.WriteLine("Did consume {0:N0}bytes", after - before);
        }
    }
}

A similar memory conserving list can be found here 这里可以找到类似的内存保存列表

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM