简体   繁体   English

BST还是哈希表?

[英]BST or Hash Table?

I have large text files upon which all kinds of operations need to be performed, mostly involving row by row validations. 我有大型文本文件,需要执行各种操作,主要涉及逐行验证。 The data are generally of a sales / transaction nature, and thus tend to contain a huge amount of redundant information across rows, such as customer names. 数据通常具有销售/交易性质,因此往往包含跨行的大量冗余信息,例如客户名称。 Iterating and manipulating this data has become such a common task that I'm writing a library in C that I hope to make available as a Python module. 迭代和操作这些数据已经成为一项常见的任务,我正在用C语言编写一个库,我希望将其作为Python模块提供。

In one test, I found that out of 1.3 million column values, only ~300,000 were unique. 在一次测试中,我发现在130万列值中,只有约300,000个是唯一的。 Memory overhead is a concern, as our Python based web application could be handling simultaneous requests for large data sets. 内存开销是一个问题,因为我们基于Python的Web应用程序可以处理大型数据集的同时请求。

My first attempt was to read in the file and insert each column value into a binary search tree. 我的第一次尝试是读入文件并将每个列值插入二叉搜索树。 If the value has never been seen before, memory is allocated to store the string, otherwise a pointer to the existing storage for that value is returned. 如果之前从未见过该值,则分配内存来存储字符串,否则返回指向该值的现有存储的指针。 This works well for data sets of ~100,000 rows. 这适用于~100,000行的数据集。 Much larger and everything grinds to a halt, and memory consumption skyrockets. 更大,一切都停止了,内存消耗急剧上升。 I assume the overhead of all those node pointers in the tree isn't helping, and using strcmp for the binary search becomes very painful. 我假设树中所有节点指针的开销没有帮助,并且使用strcmp进行二进制搜索变得非常痛苦。

This unsatisfactory performance leads me to believe I should invest in using a hash table instead. 这种令人不满意的表现让我相信我应该投资使用哈希表。 This, however, raises another point -- I have no idea ahead of time how many records there are. 然而,这提出了另一点 - 我不知道有多少记录。 It could be 10, or ten million. 它可能是10或一千万。 How do I strike the right balance of time / space to prevent resizing my hash table again and again? 如何在时间/空间之间取得适当的平衡,以防止反复调整哈希表的大小?

What are the best data structure candidates in a situation like this? 在这种情况下,最好的数据结构候选者是什么?

Thank you for your time. 感谢您的时间。

Hash table resizing isn't a concern unless you have a requirement that each insert into the table should take the same amount of time. 除非您要求表中的每个插入都占用相同的时间,否则不需要考虑哈希表的大小调整。 As long as you always expand the hash table size by a constant factor (eg always increasing the size by 50%), the computational cost of adding an extra element is amortized O(1) . 只要您始终将哈希表大小扩展一个常数因子(例如,总是将大小增加50%),添加额外元素的计算成本就会分摊O(1) This means that n insertion operations (when n is large) will take an amount of time that is proportionate to n - however, the actual time per insertion may vary wildly (in practice, one of the insertions will be very slow while the others will be very fast, but the average of all operations is small). 这意味着n插入操作(当n很大时)将占用与n成比例的时间量 - 但是,每次插入的实际时间可能会有很大差异(实际上,其中一次插入将非常慢,而其他插入将会很慢)非常快,但所有操作的平均值都很小)。 The reason for this is that when you insert an extra element that forces the table to expand from eg 1000000 to 1500000 elements, that insert will take a lot of time, but now you've bought yourself 500000 extremely fast future inserts before you need to resize again. 这样做的原因是当你插入一个额外的元素来强制表格从1000000到1500000元素扩展时,那个插入将花费很多时间,但是现在你需要在自己需要时自己购买500000个非常快的插入再次调整大小。 In short, I'd definitely go for a hash table. 简而言之,我肯定会去哈希表。

You need to use incremental resizing of your hash table. 您需要使用哈希表的增量大小调整 In my current project, I keep track of the hash key size used in every bucket, and if that size is below the current key size of the table, then I rehash that bucket on an insert or lookup. 在我当前的项目中,我会跟踪每个存储桶中使用的哈希密钥大小,如果该大小低于表的当前密钥大小,那么我会在插入或查找上重新插入该存储桶。 On a resizing of the hash table, the key size doubles (add an extra bit to the key) and in all the new buckets, I just add a pointer back to the appropriate bucket in the existing table. 在调整散列表的大小时,密钥大小加倍(向密钥添加一个额外的位),在所有新的桶中,我只需将指针添加回现有表中的相应存储桶。 So if n is the number of hash buckets, the hash expand code looks like: 因此,如果n是散列桶的数量,则散列扩展代码如下所示:

n=n*2;
bucket=realloc(bucket, sizeof(bucket)*n);
for (i=0,j=n/2; j<n; i++,j++) {
  bucket[j]=bucket[i];
}

library in C that I hope to make available as a Python module C中的库我希望作为Python模块提供

Python already has very efficient finely-tuned hash tables built in. I'd strongly suggest that you get your library/module working in Python first. Python已经内置了非常高效的精细调整哈希表。我强烈建议你先让你的库/模块使用Python。 Then check the speed. 然后检查速度。 If that's not fast enough, profile it and remove any speed-humps that you find, perhaps by using Cython. 如果这还不够快,可以通过使用Cython对其进行分析并删除您找到的任何减速带。

setup code: 设置代码:

shared_table = {}
string_sharer = shared_table.setdefault

scrunching each input row: scrunching每个输入行:

for i, field in enumerate(fields):
    fields[i] = string_sharer(field, field)

You may of course find after examining each column that some columns don't compress well and should be excluded from "scrunching". 您当然可以在检查每个列之后找到某些列不能很好地压缩并且应该从“scrunching”中排除。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM