简体繁体 English

C中关联集合的简单空间高效实现？

[英]Simple space efficient implementations of an associative collection in C?

原文 2011-03-06 20:37:06 1 3 c/ data-structures/ memory-management/ hashtable/ associative-array

I am looking for an associative collection that supports both retrieval and insertion of values by key (deletion not important) in at least O(Log(N)) time, and that has a very low memory overhead both in terms of code size and run-time memory consumption. 我正在寻找一个关联集合，它支持在至少O（Log（N））时间内按键（删除不重要）检索和插入值，并且在代码大小和运行方面都具有非常低的内存开销时间内存消耗。

I am doing this for a small embedded application written in C, so I am trying to minimize the amount of code required, and the amount of memory consumed. 我这样做是为了用C编写的小型嵌入式应用程序，所以我试图最小化所需的代码量和消耗的内存量。

The Google sparse hash data structure would be a possibility if it wasn't written in C++, and was simpler. 如果它不是用C ++编写的，那么Google稀疏哈希数据结构就有可能，而且更简单。

Most hash table implementations that I am aware of use a fair amount of extra space, requiring at least twice as much space as the total number of key-values, or else requiring extra pointers per entry (eg bucket chaining hash algorithms). 我所知道的大多数哈希表实现使用了相当多的额外空间，需要至少两倍于键值总数的空间，或者每个条目需要额外的指针（例如桶链式哈希算法）。 In my structure, key value pairs are just two pointers. 在我的结构中，键值对只是两个指针。

Currently I am using an array of key/value pairs which is sorted, but the insertion is O(N). 目前我正在使用已排序的键/值对数组，但插入是O（N）。 I can't help but think there must be a clever way to improve the amortized running time of insertion, for example by doing the insertions in groups, but I am not having any success. 我不禁想到必须有一种聪明的方法来改善插入的摊销运行时间，例如通过在组中进行插入，但我没有取得任何成功。

I think that this must be a relatively well-known problem in certain circles, so to make this not too subjective, I'm wondering what the most common solution to the problem stated above is? 我认为这在某些圈子中一定是一个比较着名的问题，所以为了使这不太主观，我想知道上述问题最常见的解决办法是什么？

[EDIT:] [编辑：]

Some additional information that could be relevant: 一些可能相关的其他信息：

Keys are integers 键是整数
Number of values could be tiny anywhere from 1 to 2^32. 值的数量可以很小，从1到2 ^ 32。
Usage patterns are unpredicatable. 使用模式是不可预测的。
I am hoping to keep memory consumption as low as possible (eg doubling the size of memory required, would not be ideal) 我希望尽可能降低内存消耗（例如，所需内存大小加倍，不太理想）

3 个解决方案

查看二叉搜索树并克服最坏情况（搜索和插入都具有O（n）复杂度）使用平衡树。

You could use a hash table that doesn't use chaining, such as a linear probing or cuckoo hashing scheme. 您可以使用不使用链接的哈希表，例如线性探测或布谷鸟哈希方案。 The backing implementation is just an array, and with a load factor of around 0.5, the overhead won't be too bad, and the implementation complexity (at least for linear or quadratic probing) isn't too much. 支持实现只是一个数组，负载因子大约为0.5，开销不会太差，实现复杂性（至少对于线性或二次探测）并不算太多。

If you want a good implementation of a binary search tree that has excellent guarantees on performance and isn't too hard to code up, consider looking into splay trees. 如果你想要一个良好的二进制搜索树实现，它具有出色的性能保证，并且不易编码，请考虑查看splay树。 They guarantee amortized O(lg n) lookups, and require just two pointers per node. 它们保证了摊销的 O（lg n）查找，并且每个节点只需要两个指针。 The balance step is also substantially easier than most balanced BSTs. 平衡步骤也比大多数平衡BST更容易。

I'd probably use a hash table with double hashing to resolve collisions. 我可能会使用带有双哈希的哈希表来解决冲突。 The general idea is to hash your original value, and if that collides do a second hash that gives a step value you'll use in walking through the array to find a place to put the value. 一般的想法是哈希你的原始值，如果碰撞做了第二个哈希，给出一个步骤值，你将用于遍历数组来找到放置值的位置。 This makes quite good use of memory as it has no overhead for pointers, and retains reasonable efficiency at much higher load factors than linear probing. 这很好地利用了内存，因为它没有指针开销，并且在比线性探测更高的负载因子下保持合理的效率。

Edit: If you want a variation of what you're doing right now, one possibility is to handle insertions in clusters: keep a sorted array, and a separate collection of new insertions. 编辑：如果您想要改变现在正在做的事情，一种可能性是处理集群中的插入：保留一个已排序的数组，以及一个单独的新插入集合。 When the collection of new insertions gets too large, merge those items into the main collection. 当新插入的集合变得太大时，将这些项合并到主集合中。

For the secondary collection you have a couple of choices. 对于二级收藏，您有几个选择。 You can just use an un-sorted array, and do a linear search -- and just limit its size so (say) log(M), where M is the size of the main array. 您可以使用未排序的数组，并进行线性搜索 - 并且只是限制其大小，所以（比如）log（M），其中M是主数组的大小。 In this case, an overall search remains O(log N), imposes no memory overhead, and keeps most insertions quite fast. 在这种情况下，整体搜索仍为O（log N），不会产生任何内存开销，并且可以非常快速地保留大多数插入。 When you do merge the collections together, you (normally) want to sort the secondary collection, then merge with the primary. 当您将集合合并在一起时，您（通常）希望对辅助集合进行排序，然后与主集合合并。 This lets you amortize the linear merge over the number of items in the secondary collection. 这使您可以在次要集合中的项目数上分摊线性合并。

Alternatively, you can use a tree for your secondary collection. 或者，您可以将树用于辅助集合。 This means newly inserted items use extra storage for pointers, but (again) keeping that size small limits the overhead. 这意味着新插入的项目使用额外的存储指针，但（再次）保持这个大小限制开销。