简体   繁体   English

根据任何给定的字符串生成整数(不含GetHashCode)

[英]Generate integer based on any given string (without GetHashCode)

I'm attempting to write a method to generate an integer based on any given string. 我正在尝试编写一个方法来根据任何给定的字符串生成一个整数。 When calling this method on 2 identical strings, I need the method to generate the same exact integer both times. 在2个相同的字符串上调用此方法时,我需要该方法两次生成相同的完整整数。

I tried using .GetHasCode() however this is very unreliable once I move the project to another machine, as GetHasCode() returns different values for the same string 我尝试使用.GetHasCode()但是当我将项目移动到另一台机器时,这是非常不可靠的,因为GetHasCode()为同一个字符串返回不同的值

It is also important that the collision rate be VERY low. 碰撞率非常低也很重要。 Custom methods I have written thus far produce collisions after just a few hundred thousand records. 到目前为止我编写的自定义方法仅在几十万个记录之后产生冲突。

The hash value MUST be an integer. 哈希值必须是整数。 A string hash value (like md5) would cripple my project in terms of speed and loading overhead. 字符串哈希值(如md5)会在速度和负载开销方面削弱我的项目。

The integer hashes are being used to perform extremely rapid text searches, which I have working beautifully, however it currently relies on .GetHasCode() and doesn't work when multiple machines get involved. 整数哈希用于执行非常快速的文本搜索,我工作得很漂亮,但是它当前依赖于.GetHasCode()并且在涉及多台机器时不起作用。

Any insight at all would be greatly appreciated. 任何见解都将非常感激。

MD5 hashing returns a byte array which could be converted to an integer: MD5哈希返回一个字节数组,可以转换为整数:

var mystring = "abcd";
MD5 md5Hasher = MD5.Create();
var hashed = md5Hasher.ComputeHash(Encoding.UTF8.GetBytes(mystring));
var ivalue = BitConverter.ToInt32(hashed, 0);

Of course, you are converting from a 128 bit hash to a 32 bit int, so some information is being lost which will increase the possibility of collisions. 当然,您正在从128位散列转换为32位整数,因此一些信息正在丢失,这将增加冲突的可能性。 You could try adjusting the second parameter to ToInt32 to see if any specific ranges of the MD5 hash produce fewer collisions than others for your data. 您可以尝试将第二个参数调整为ToInt32以查看MD5哈希的任何特定范围是否产生的冲突少于数据的其他范围。

If your hash code creates duplicates "after a few hundred thousand records," you have a pretty good hash code implementation. 如果您的哈希代码在“几十万条记录之后”创建了重复项,那么您就拥有了非常好的哈希代码实现。

If you do the math , you'll find that a 32-bit hash code has a 50% chance of creating a duplicate after about 70,000 records. 如果你进行数学计算 ,你会发现32位哈希码有大约50%的机会在大约70,000条记录后创建一个副本。 The probability of generating a duplicate after a million records is so close to certainty as not to matter. 在一百万条记录之后产生副本的可能性非常接近于确定无关紧要。

As a rule of thumb, the likelihood of generating a duplicate hash code is 50% when the number of records hashed is equal to the square root of the number of possible values. 根据经验,当散列的记录数等于可能值数的平方根时,生成重复散列码的可能性为50%。 So with a 32 bit hash code that has 2^32 possible values, the chance of generating a duplicate is 50% after approximately 2^16 (65,536) values. 因此,对于具有2 ^ 32个可能值的32位哈希码,在大约2 ^ 16(65,536)个值之后生成重复的机会是50%。 The actual number is slightly larger--closer to 70,000--but the rule of thumb gets you in the ballpark. 实际数字略大 - 接近70,000 - 但经验法则会让你进入大球场。

Another rule of thumb is that the chance of generating a duplicate is nearly 100% when the number of items hashed is four times the square root. 另一个经验法则是,当散列的项目数是平方根的四倍时,生成重复的几率几乎为100%。 So with a 32-bit hash code you're almost guaranteed to get a collision after only 2^18 (262,144) records hashed. 因此,使用32位哈希码,您几乎可以保证在仅有2 ^ 18(262,144)个记录散列后发生冲突。

That's not going to change if you use the MD5 and convert it from 128 bits to 32 bits. 如果使用MD5并将其从128位转换为32位,则不会改变。

此代码将任何字符串映射到0到100之间的int

int x= "ali".ToCharArray().Sum(x => x)%100;
using (MD5 md5 = MD5.Create())
{
    bigInteger = new BigInteger(md5.ComputeHash(Encoding.Default.GetBytes(myString)));
}

BigInteger requires Org.BouncyCastle.Math BigInteger需要Org.BouncyCastle.Math

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM