简体   繁体   English

C#中带有两个哈希函数的字典?

[英]Dictionary with two hash functions in C#?

I've got a huge (>>10m) list of entries. 我有一个巨大的(>> 10米)条目列表。 Each entry offers two hash functions: 每个条目都提供两个哈希函数:

  • Cheap: quickly computes hash, but its distribution is terrible (may put 99% of items in 1% of hash space) 便宜:快速计算哈希值,但其分布很糟糕(可能将99%的项目放在1%的哈希空间中)
  • Expensive: takes a lot of time to compute, but the distribution is a lot better also 昂贵:需要花费大量时间进行计算,但分布也要好得多

An ordinary Dictionary lets me use only one of these hash functions. 普通的字典让我只使用其中一个哈希函数。 I'd like a Dictionary that uses the cheap hash function first, and checks the expensive one on collisions. 我想要一个首先使用廉价哈希函数的字典,并在碰撞中检查昂贵的哈希函数。

It seems like a good idea to use a dictionary inside a dictionory for this. 为此,在词典中使用字典似乎是个好主意。 I currently basically use this monstrosity: 我目前基本上使用这个怪物:

Dictionary<int, Dictionary<int, List<Foo>>>;

I improved this design so the expensive hash gets called only if there are actually two items of the same cheap hash. 我改进了这个设计,所以只有当实际上有两个相同的廉价哈希项时才会调用昂贵的哈希。

It fits perfectly and does a flawless job for me, but it looks like something that should have died 65 million years ago. 它完美地适合我,并为我做一个完美的工作,但它看起来应该已经死了6500万年前。

To my knowledge, this functionality is not included in the basic framework. 据我所知,此功能未包含在基本框架中。 I am about to write a DoubleHashedDictionary class but I wanted to know of your opinion first. 我即将写一篇DoubleHashedDictionary类,但我想先了解你的意见。

As for my specific case: 至于我的具体情况:
First hash function = number of files in a file system directory (fast) Second hash function = sum of size of files (slow) 第一个哈希函数=文件系统目录中的文件数(快)第二个哈希函数=文件大小的总和(慢)

Edits: 编辑:

  • Changed title and added more informations. 更改了标题并添加了更多信息。
  • Added quite important missing detail 添加了非常重要的缺失细节

In your case, you are technically using a modified function (A|B), not a double-hashed. 在您的情况下,您在技术上使用修改的函数(A | B),而不是双散列函数。 However, depending on how huge your "huge" list of entries is and the characteristics of your data, consider the following: 但是,根据您的“巨大”条目列表的大小以及数据的特征,请考虑以下事项:

  • A 20% full hash table with a not-so-good distribution can have more than 80% chance of collision. 具有不太好的分布的20%完整哈希表可能具有超过80%的冲突机会。 This means your expected function cost could be: (0.8 expensive + 0.2 cheap) + (cost of lookups). 这意味着您的预期功能成本可能是:(0.8昂贵+ 0.2便宜)+(查找成本)。 So if your table is more than 20% full it may not be worth using the (A|B) scheme. 因此,如果您的餐桌超过20%,则可能不值得使用(A | B)方案。

  • Coming up with a perfect hash function is possible but O(n^3) which makes it impractical. 提出一个完美的哈希函数是可能的,但O(n ^ 3)使它变得不切实际。

  • If performance is supremely important, you can make a specifically tuned hash table for your specific data by testing various hash functions on your key data. 如果性能非常重要,您可以通过测试关键数据上的各种哈希函数,为特定数据制作专门调整的哈希表。

Have you taken a look at the Power Collections or C5 Collections libaries? 您是否看过Power CollectionsC5 Collections库? The Power Collections library hasn't had much action recently, but the C5 stuff seems to be fairly up to date. Power Collections库最近没有太多动作,但C5的东西似乎是相当最新的。

I'm not sure if either library has what you need, but they're pretty useful and they're open source so it may provide a decent base implementation for you to extend to your desired functionality. 我不确定这两个库是否具有您需要的功能,但它们非常有用并且它们是开源的,因此它可以为您提供一个合适的基础实现,以扩展到您所需的功能。

You're basically talking about a hash table of hash table's, each using a different GetHashCode implementation... while it's possible I think you'd want to consider seriously whether you'll actually get a performance improvement over just doing one or the other... 你基本上是在谈论哈希表的哈希表,每个哈希表都使用不同的GetHashCode实现......虽然我认为你可能会认真考虑一下你是否真的会在一个或另一个上做一个性能改进...

Will there actually be a substantial number of objects that will be located via the quick-hash mechanism without having to resort to the more expensive to narrow it down further? 实际上是否会有大量的对象通过快速哈希机制定位,而不必采用更昂贵的对象来进一步缩小范围? Because if you can't locate a significant amount purely off the first calculation you really save nothing by doing it in two steps (Not knowing the data it's hard to predict whether this is the case). 因为如果你不能完全从第一次计算中找到大量的数据,你就可以分两步完成任务(不知道数据很难预测是否是这种情况)。

If it is going to be a significant amount located in one step then you'll probably have to do a fair bit of tuning to work out how many records to store on each hash location of the outer before resorting to an inner "expensive" hashtable lookup rather than the more treatment of hashed data, but under certain circumstances I can see how you'd get a performance gain from this (The circumstances would be few and far between, but aren't inconceivable). 如果它将在一个步骤中成为一个重要的数量,那么你可能需要进行一些调整以计算出在外部的每个散列位置存储多少记录,然后再使用内部“昂贵”的散列表查找而不是散列数据的更多处理,但在某些情况下,我可以看到你如何从中获得性能增益(情况会很少,而且不可思议)。

Edit 编辑

I just saw your ammendment to the question - you plan to do both lookups regardless... I doubt you'll get any performance benefits from this that you can't get just by configuring the main hash table a bit better. 我刚刚看到你对这个问题的修正 - 你打算不管怎么做两次查找...我怀疑你会从中获得任何性能上的好处,你不能通过更好地配置主哈希表来获得。 Have you tried using a single dictionary with an appropriate capacity passed in the constructor and perhaps an XOR of the two hash codes as your hash code? 您是否尝试使用在构造函数中传递适当容量的单个字典,并且可能将两个哈希代码的XOR作为哈希代码?

First off, I think you're on the right path to implement your own hashtable, if what you are describing is truely desired.But as a critic, I'd like to ask a few questions: 首先,我认为你正在实现自己的散列表的正确途径,如果你所描述的是真正需要的。但作为评论家,我想问几个问题:

Have you considered using something more unique for each entry? 您是否考虑过为每个条目使用更独特的东西?

I am assuming that each entry is a file system directory information, have you considered using its full path as key? 我假设每个条目都是文件系统目录信息,您是否考虑使用其完整路径作为密钥? prefixing with computer name/ip address? 用计算机名/ IP地址加前缀?

On the other hand, if you're using number of files as hash key, are those directories never going to change? 另一方面,如果您使用多个文件作为哈希键,这些目录是否永远不会改变? Because if the hash key/result changes, you will never be able to find it again. 因为如果散列键/结果发生变化,您将永远无法再找到它。

While on this topic, if the directory content/size is never going to change, can you store that value somewhere to save the time to actually calculate that? 在这个主题上,如果目录内容/大小永远不会改变,你可以将该值存储在某处以节省实际计算时间吗?

Just my few cents. 只是我的几美分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM