简体   繁体   English

.Net C#String.GetHashCode()替代方案

[英].Net C# String.GetHashCode() alternative

I have problem with comparing lot of string data (csv files). 比较很多字符串数据(csv文件)时遇到问题。 These files has uniqueID but are not sorted and they are quite big. 这些文件具有唯一ID但未排序且非常大。

So I tried to create two dictionaries where key is uniqueID from file and Value is int which returns GetHashCode() of string which I'm interested in for changes. 所以我尝试创建两个字典,其中key是来自file的uniqueID,而Value是int,它返回我感兴趣的字符串的GetHashCode()以进行更改。

But, short example: 但是,简短的例子:

if ("30000100153:135933:Wuchterlova:335:2:Praha:16000".GetHashCode() == 
    "30000263338:158364:Radošovická:1323:10:Praha:10000".GetHashCode())
{
    Console.WriteLine("Hmm that's strange");
}

So is there any other way how to do that. 那么有没有其他方法可以做到这一点。

I need as little footprit as possible (due to memory allocation of two dictionarie of two csv files which contains about 3M rows) Thank you 我需要尽可能少的footprit(由于两个csv文件的两个字典的内存分配,其中包含大约3M行)谢谢

First of all, the documentation for string.GetHashCode specifically says to not use string hash codes for any application where they need to be stable over time, because they are not. 首先,string.GetHashCode的文档明确表示不要将字符串哈希码用于需要随时间稳定的任何应用程序,因为它们不是。 You should be using string hash codes for one purpose only, and that is to put strings in a dictionary. 您应该仅将字符串哈希码用于一个目的,即将字符串放入字典中。

Second, hash codes are not unique. 其次,哈希码不是唯一的。 There are only four billion possible hash codes (because the hash code is a 32 bit integer) but obviously there are more than four billion strings, so there must be many strings that have the same hash code. 只有40亿个可能的哈希码(因为哈希码是32位整数)但显然有超过40亿个字符串,因此必须有许多具有相同哈希码的字符串。 A collection of only a few thousand strings has an extremely high probability of containing two strings with the same hash code. 只有几千个字符串的集合具有包含具有相同哈希码的两个字符串的极高概率。 A graph of the probability is here: 概率图在这里:

http://blogs.msdn.com/b/ericlippert/archive/2010/03/22/socks-birthdays-and-hash-collisions.aspx http://blogs.msdn.com/b/ericlippert/archive/2010/03/22/socks-birthdays-and-hash-collisions.aspx

So you might wonder how the dictionary works at all then, if it is using GetHashCode but there can be collisions. 所以你可能想知道字典是如何工作的,如果它使用的是GetHashCode,但可能存在冲突。 The answer is: when you put two things X and Y in a dictionary that have the same hash code, they go in the same "bucket". 答案是:当你把两个东西X和Y放在一个具有相同哈希码的字典中时,它们会进入同一个“桶”。 When you search for X the dictionary goes to the right bucket using the hash code, and then does the expensive equality check on each element in the bucket until it finds the right one. 当您搜索X时,字典会使用哈希代码转到右侧存储桶,然后对存储桶中的每个元素执行昂贵的相等检查,直到找到正确的字符。 Since each bucket is small, this check is still fast enough most of the time. 由于每个桶都很小,因此大多数情况下此检查仍然足够快。

I don't know how to solve your problem, but using a 32 bit hash is clearly not the right way to do it, so try something else. 我不知道如何解决你的问题,但使用32位哈希显然不是正确的方法,所以尝试其他的东西。 My suggestion would be to start using a database rather than CSV files if you have a lot of data to manage. 我的建议是,如果要管理大量数据,请开始使用数据库而不是CSV文件。 That's what a database is for. 这就是数据库的用途。

I have written many articles on string hashing that might interest you: 我写了很多关于字符串哈希的文章,你可能会感兴趣:

http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/ http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/

http://blogs.msdn.com/b/ericlippert/archive/2011/07/12/what-curious-property-does-this-string-have.aspx http://blogs.msdn.com/b/ericlippert/archive/2011/07/12/what-c​​urious-property-does-this-string-have.aspx

http://blogs.msdn.com/b/ericlippert/archive/2005/10/24/do-not-use-string-hashes-for-security-purposes.aspx http://blogs.msdn.com/b/ericlippert/archive/2005/10/24/do-not-use-string-hashes-for-security-purposes.aspx

http://blogs.msdn.com/b/ericlippert/archive/tags/hashing/ http://blogs.msdn.com/b/ericlippert/archive/tags/hashing/

You don't actually want to use GetHashCode. 您实际上并不想使用GetHashCode。 You should just compare the strings directly. 你应该直接比较字符串。 However, comparing each of 3M strings against each of another 3M strings is going to be difficult in any reasonable time without sorting the lists first. 然而,在没有首先对列表进行排序的情况下,在任何合理的时间内将每个3M字符串与另一个3M字符串进行比较将是困难的。

My approach would be to sort each list first (how to do that depends on a number of things), read the first sorted from each - lets call then A and B and: 我的方法是首先对每个列表进行排序(如何做到这取决于许多事情),读取从每个列表中排序的第一个 - 然后调用A和B,然后:

  1. if A = B then do whatever and read next A and next B and repeat 如果A = B然后做任何事情并阅读下一个A和下一个B并重复
  2. if A < B do whatever and read next A and repeat 如果A <B做任何事情并阅读下一个A并重复
  3. if A > B do whatever and read next B and repeat 如果A> B做任何事情并阅读下一个B并重复

.. where 'do whatever' means do whatever is required in that situation and repeat means go back to step 1. ..在这种情况下,“做任何事”意味着做什么都需要做,重复意味着回到第1步。

(This process is how mainframe computers used to merge stacks of cards and has a particular name, but I can't for the life of me remember it!) (这个过程是大型机计算机用来合并卡片堆栈并具有特定名称的过程,但我不能为我的生活记住它!)

Cheers - 干杯 -

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM