简体   繁体   中英

How to generate a unique hash code for an object, based on its contents?

I need to generate a unique hash code for an object, based on its contents, eg DateTime(2011,06,04) should equal DateTime(2011,06,04).

  • I cannot use .GetHashCode() because it might generate the same hash code for objects with different contents.
  • I cannot use .GetID from ObjectIDGenerator as it generates a different hash code for objects with the same contents.
  • If the object contains other sub-objects, it needs to recursively check these.
  • It needs to work on collections.

The reason I need to write this? I'm writing a caching layer using PostSharp.

Update

I think I may have been asking the wrong question. As Jon Skeet pointed out, to be on the safe side, I need as many unique combinations in the cache key as there are combinations of potential data in the object. Therefore, the best solution might be to build up a long string that encodes the public properties for the object, using reflection. The objects are not too large so this is very quick and efficient:

  • Its efficient to construct the cache key (just convert the public properties of the object into a big string).
  • Its efficient to check for a cache hit (compare two strings).

From a comment:

I'd like something like a GUID based on the objects contents. I don't mind if there's the occasional duplicate every 10 trillion trillion trillion years or so

That seems like an unusual requirement but since that's your requirement, let's do the math.

Let's suppose you make a billion unique objects a year -- thirty per second -- for 10 trillion trillion trillion years. That's 10 49 unique objects you're creating. Working out the math is quite easy; the probability of at least one hash collision in that time is above one in 10 18 when the bit size of the hash is less than 384.

Therefore you'll need at least a 384 bit hash code to have the level of uniqueness that you require. That's a convenient size, being 12 int32s. If you're going to be making more than 30 objects a second or want the probability to be less than one in 10 18 then more bits will be necessary.

Why do you have such stringent requirements?

Here's what I would do if I had your stated requirements. The first problem is to convert every possible datum into a self-describing sequence of bits. If you have a serialization format already, use that. If not, invent one that can serialize all possible objects that you are interested in hashing.

Then, to hash the object, serialize it into a byte array and then run the byte array through the SHA-384 or SHA-512 hashing algorithm. That will produce a professional-crypto-grade 384 or 512 bit hash that is believed to be unique even in the face of attackers trying to force collisions. That many bits should be more than enough to ensure low probability of collision in your ten trillion trillion trillion year timeframe.

If you need to create a unique hash code, then you're basically talking about a number which can represent as many states as your type can have. For DateTime than means taking the Ticks value and the DateTimeKind , I believe.

You may be able to get away with assuming that the top two bits of the Ticks property are going to be zero, and using those to store the kind. That means you're okay up until the year 7307 as far as I can tell:

private static ulong Hash(DateTime when)
{
    ulong kind = (ulong) (int) when.Kind;
    return (kind << 62) | (ulong) when.Ticks;
}

You are not talking about a hash code here, you need a number representation of your state - for that to be unique it might have to be incredibly large depending on your object structure.

The reason I need to write this? I'm writing a caching layer using PostSharp.

Why don't you use a regular hashcode instead, and handle collisions by actually comparing the objects? That seems to be the most reasonable approach.

An addition to BrokenGlass' answer, which I have voted up and consider to be correct:

Using the GetHashCode / Equals method means that if two objects hash to the same value you 'll be relying in their Equals implementation to tell you if they are equivalent.

Unless these objects override Equals (which would practically mean that they implement IEquatable<T> where T is their type), the default implementation of Equals is going to do a reference comparison. This in turn means that your cache would mistakenly yield a miss for objects which are "equal" in the business sense but have been constructed independently.

Consider the usage model for your cache carefully , because if you end up using it for classes that are not IEquatable and in a manner where you expect to be checking non-reference-equal objects for equality, the cache will turn out to be completely useless .

We had exactly the same requirement and here is the function I came up with. This is what works well for types of objects we need to cache

public static string CreateCacheKey(this object obj, string propName = null)
{
    var sb = new StringBuilder();
    if (obj.GetType().IsValueType || obj is string)
        sb.AppendFormat("{0}_{1}|", propName, obj);
    else
        foreach (var prop in obj.GetType().GetProperties())
        {
            if (typeof(IEnumerable<object>).IsAssignableFrom(prop.PropertyType))
            {
                var get = prop.GetGetMethod();
                if (!get.IsStatic && get.GetParameters().Length == 0)
                {
                    var collection = (IEnumerable<object>)get.Invoke(obj, null);
                    if (collection != null)
                        foreach (var o in collection)
                            sb.Append(o.CreateCacheKey(prop.Name));
                }
            }
            else
                sb.AppendFormat("{0}{1}_{2}|", propName, prop.Name, prop.GetValue(obj, null));

        }
    return sb.ToString();
}

So for example if we have something like this

var bar = new Bar()
{
    PropString = "test string",
    PropInt = 9,
    PropBool = true,
    PropListString = new List<string>() {"list string 1", "list string 2"},
    PropListFoo =
        new List<Foo>()
            {new Foo() {PropString = "foo 1 string"}, new Foo() {PropString = "foo 2 string"}},
    PropListTuple =
        new List<Tuple<string, int>>()
            {
                new Tuple<string, int>("tuple 1 string", 1), new Tuple<string, int>("tuple 2 string", 2)
            }
};

var cacheKey = bar.CreateCacheKey();

Cache key generated by method above will be

PropString_test string|PropInt_9|PropBool_True|PropListString_list string 1|PropListString_list string 2|PropListFooPropString_foo 1 string|PropListFooPropString_foo 2 string|PropListTupleItem1_tuple 1 string|PropListTupleItem2_1|PropListTupleItem1_tuple 2 string|PropListTupleItem2_2|

You can calculate ex md5 sum (or something like that) from object serialized to json. If you want only some properties to matter, you can create anonymous object on the way:

 public static string GetChecksum(this YourClass obj)
    {
        var copy = new
        {
           obj.Prop1,
           obj.Prop2
        };
        var json = JsonConvert.SerializeObject(ob);

        return json.CalculateMD5Hash();
    }

I use that for checking if someone messed with my database storing license based data. You can also append json variable with some seed to complicate stuff

I cannot use .GetHashCode() because it might generate the same hash code for objects with different contents.

It's quite normal for a hash code to have collisions. If your hash code has a fixed length (32 bits in the case of the standard .NET hash code), then you're bound to have collisions with any values whose range is bigger than this (eg 64 bits for long; n*64 bits for an array of n longs etc).

In fact for any hash code with a finite length N, there will always be collisions for collections of more than N elements.

What you're asking for isn't feasible in the general case.

Would this extension method suit your purposes? If the object is a value type, it just returns its hash code. Otherwise, it recursively gets the value of each property and combines them into a single hash.

using System.Reflection;

public static class HashCode
{
    public static ulong CreateHashCode(this object obj)
    {
        ulong hash = 0;
        Type objType = obj.GetType();

        if (objType.IsValueType || obj is string)
        {
            unchecked
            {
                hash = (uint)obj.GetHashCode() * 397;
            }

            return hash;
        }

        unchecked
        {
            foreach (PropertyInfo property in obj.GetType().GetProperties())
            {
                object value = property.GetValue(obj, null);
                hash ^= value.CreateHashCode();
            }
        }

        return hash;
    }
}

Generic Extension Method

public static class GenericExtensions
{
    public static int GetDeepHashCode<T>(this T obj)
    {
        if (obj == null)
            return 0;

        if (typeof(T).IsValueType)
            return obj.GetHashCode();

        var result = 0;

        if (typeof(T) is IEnumerable)
        {
            var enumerable = obj as IEnumerable<T>;

            using (var enumerator = enumerable.GetEnumerator())
            {
                var i = 1;

                while (true)
                {
                    bool moveNextA = enumerator.MoveNext();

                    if (!moveNextA)
                        break;

                    var current = enumerator.Current;

                    result += current.GetDeepHashCode() * i;

                    i++;
                }

                return result;
            }
        }

        foreach (var property in obj.GetType().GetProperties())
        {
            var value = property.GetValue(obj);

            result += value.GetDeepHashCode();
        }

        return result;
    }
}

Some of the answers here serialize to JSON and generate an MD5 hash from that. This works most the time except when you have collections and the item order is different. The same object could generate different hashes because of the collection order difference.

The solution I came up with is below where I serialize to JSON (using Newtonsoft Json.NET) and order any child collections by hashing each of the items and sorting by that hash. This gives us a deterministic serialized representation we can generate a hash on.

There might be some scenarios I'm not fully accounting for, but this works for the nested collections of complex objects for most common scenarios.

static class ObjectHashGenerator
{
    private static readonly OrderedPropertiesContractResolver ContractResolver = new();
    private static readonly OrderedCollectionConverter Converter = new();
    private static readonly IList<JsonConverter> Converters = new List<JsonConverter>(new[] { Converter });
    private static readonly JsonSerializerSettings Settings = new()
    {
        ContractResolver = ContractResolver,
        Converters = Converters
    };
    
    public static string GenerateHash(this object item)
    {
        var serializedItem = JsonConvert.SerializeObject(item, Settings);
        var hash = GenerateMd5(serializedItem);
        return hash;
    }

    public static string GenerateMd5(string input)
    {
        using var md5 = MD5.Create();
        var inputBytes = Encoding.UTF8.GetBytes(input);
        var hashBytes = md5.ComputeHash(inputBytes);
        return Convert.ToHexString(hashBytes);
    }
}

sealed class OrderedPropertiesContractResolver : DefaultContractResolver
{
    protected override IList<JsonProperty> CreateProperties(Type type, MemberSerialization memberSerialization)
    {
        var props = base.CreateProperties(type, memberSerialization);
        return props.OrderBy(p => p.PropertyName).ToList();
    }
}

sealed class OrderedCollectionConverter : JsonConverter
{
    public override bool CanConvert(Type type)
    {
        if (type == typeof(string)) 
            return false;
        
        return typeof(IEnumerable).IsAssignableFrom(type);
    }

    public override void WriteJson(JsonWriter writer, object? value, JsonSerializer serializer)
    {
        if (value is not IEnumerable enumerable) 
            return;
        
        var itemsJson = new List<string>();
        
        foreach (var item in enumerable)
        {
            var stringBuilder = new StringBuilder();
            using var stringWriter = new StringWriter(stringBuilder);
            serializer.Serialize(stringWriter, item); 
            
            var result = stringBuilder.ToString();
            itemsJson.Add(result);
            stringBuilder.Clear();
        }
        
        // We order each collection by hash of the item so the serialized JSON is deterministically 
        // created so the hash can be the same for objects regardless of collection order on the original.
        writer.WriteStartArray();
        foreach (var item in itemsJson.OrderBy(ObjectHashGenerator.GenerateMd5))
            writer.WriteRawValue(item);
        writer.WriteEndArray();
    }

    public override object ReadJson(JsonReader reader, Type type, object? existingValue, JsonSerializer serializer)
    {
        // This converter is only used for serialization in order to generate a hash
        throw new NotImplementedException();
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM