简体   繁体   中英

Generate integer based on any given string (without GetHashCode)

I'm attempting to write a method to generate an integer based on any given string. When calling this method on 2 identical strings, I need the method to generate the same exact integer both times.

I tried using .GetHasCode() however this is very unreliable once I move the project to another machine, as GetHasCode() returns different values for the same string

It is also important that the collision rate be VERY low. Custom methods I have written thus far produce collisions after just a few hundred thousand records.

The hash value MUST be an integer. A string hash value (like md5) would cripple my project in terms of speed and loading overhead.

The integer hashes are being used to perform extremely rapid text searches, which I have working beautifully, however it currently relies on .GetHasCode() and doesn't work when multiple machines get involved.

Any insight at all would be greatly appreciated.

MD5 hashing returns a byte array which could be converted to an integer:

var mystring = "abcd";
MD5 md5Hasher = MD5.Create();
var hashed = md5Hasher.ComputeHash(Encoding.UTF8.GetBytes(mystring));
var ivalue = BitConverter.ToInt32(hashed, 0);

Of course, you are converting from a 128 bit hash to a 32 bit int, so some information is being lost which will increase the possibility of collisions. You could try adjusting the second parameter to ToInt32 to see if any specific ranges of the MD5 hash produce fewer collisions than others for your data.

If your hash code creates duplicates "after a few hundred thousand records," you have a pretty good hash code implementation.

If you do the math , you'll find that a 32-bit hash code has a 50% chance of creating a duplicate after about 70,000 records. The probability of generating a duplicate after a million records is so close to certainty as not to matter.

As a rule of thumb, the likelihood of generating a duplicate hash code is 50% when the number of records hashed is equal to the square root of the number of possible values. So with a 32 bit hash code that has 2^32 possible values, the chance of generating a duplicate is 50% after approximately 2^16 (65,536) values. The actual number is slightly larger--closer to 70,000--but the rule of thumb gets you in the ballpark.

Another rule of thumb is that the chance of generating a duplicate is nearly 100% when the number of items hashed is four times the square root. So with a 32-bit hash code you're almost guaranteed to get a collision after only 2^18 (262,144) records hashed.

That's not going to change if you use the MD5 and convert it from 128 bits to 32 bits.

此代码将任何字符串映射到0到100之间的int

int x= "ali".ToCharArray().Sum(x => x)%100;
using (MD5 md5 = MD5.Create())
{
    bigInteger = new BigInteger(md5.ComputeHash(Encoding.Default.GetBytes(myString)));
}

BigInteger requires Org.BouncyCastle.Math

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM