简体   繁体   English

用于C ++(或C)的mmap可加载数据结构库

[英]mmap-loadable data structure library for C++ (or C)

I have a some large data structure (N > 10,000) that usually only needs to be created once (at runtime), and can be reused many times afterwards, but it needs to be loaded very quickly. 我有一些大型数据结构(N> 10,000),通常只需要创建一次(在运行时),之后可以多次重复使用,但需要非常快速地加载。 (It is used for user input processing on iPhoneOS.) mmap -ing a file seems to be the best choice. (它用于iPhoneOS上的用户输入处理。) mmap -ing文件似乎是最好的选择。

Are there any data structure libraries for C++ (or C)? C ++(或C)是否有任何数据结构库? Something along the line 沿线的东西

ReadOnlyHashTable<char, int> table ("filename.hash");
// mmap(...) inside the c'tor
...
int freq = table.get('a');
...
// munmap(...); inside the d'tor.

Thank you! 谢谢!


Details: 细节:

I've written a similar class for hash table myself but I find it pretty hard to maintain, so I would like to see if there's existing solutions already. 我自己写了一个类似哈希表的类,但我觉得很难维护,所以我想看看是否已有解决方案。 The library should 图书馆应该

  • Contain a creation routine that serialize the data structure into file. 包含将数据结构序列化为文件的创建例程。 This part doesn't need to be fast. 这部分不需要很快。
  • Contain a loading routine that mmap a file into read-only (or read-write) data structure that can be usable within O(1) steps of processing. 包含一个加载例程,该例程将文件映射为只能在O(1)处理步骤中使用的只读(或读写)数据结构。
  • Use O(N) amount of disk/memory space with a small constant factor. 使用O(N)量的磁盘/内存空间和一个小的常数因子。 (The device has serious memory constraint.) (设备有严重的内存限制。)
  • Small time overhead to accessors. 访问者的时间开销很小。 (ie the complexity isn't modified.) (即复杂性未被修改。)

Assumptions: 假设:

  • Bit representation of data (eg endianness, encoding of float , etc.) does not matter since it is only used locally. 数据的位表示(例如字节顺序, float编码等)无关紧要,因为它仅在本地使用。
  • So far the possible types of data I need are integers, strings, and struct 's of them. 到目前为止,我需要的可能的数据类型是它们的整数,字符串和struct Pointers do not appear. 指针不会出现。

PS Can Boost.intrusive help? PS Can Boost.intrusive帮助?

You could try to create a memory mapped file and then create the STL map structure with a customer allocator. 您可以尝试创建内存映射文件,然后使用客户分配器创建STL映射结构。 Your customer allocator then simply takes the beginning of the memory of the memory mapped file, and then increments its pointer according to the requested size. 然后,您的客户分配器只需占用内存映射文件的内存的开头,然后根据请求的大小递增其指针。 In the end all the allocated memory should be within the memory of the memory mapped file and should be reloadable later. 最后,所有分配的内存应该在内存映射文件的内存中,并且应该可以在以后重新加载。

You will have to check if memory is free'd by the STL map. 您必须检查STL映射是否释放了内存。 If it is, your customer allocator will lose some memory of the memory mapped file but if this is limited you can probably live with it. 如果是,您的客户分配器将丢失内存映射文件的一些内存,但如果这是有限的,您可以使用它。

Sounds like maybe you could use one of the "perfect hash" utilities out there. 听起来也许你可以使用的“完美哈希”的公用事业公司之一在那里。 These spend some time opimising the hash function for the particular data, so there are no hash collisions and (for minimal perfect hash functions) so that there are no (or at least few) empty gaps in the hash table. 这些花费一些时间来优化特定数据的散列函数,因此没有散列冲突和(对于最小的完美散列函数),因此散列表中没有(或至少很少)空间隙。 Obviously, this is intended to be generated rarely but used frequently. 显然,这是为了很少生成,但经常使用。

CMPH claims to cope with large numbers of keys. CMPH声称应对大量的钥匙。 However, I have never used it. 但是,我从未使用它。

There's a good chance it only generates the hash function, leaving you to use that to generate the data structure. 它很有可能只生成哈希函数,让您使用它来生成数据结构。 That shouldn't be especially hard, but it possibly still leaves you where you are now - maintaining at least some of the code yourself. 这应该不是特别难,但它可能仍然让你离开现在的位置 - 至少保留一些代码本身。

GVDB (GVariant Database), the core of Dconf is exactly this. GVDB(GVariant数据库),Dconf的核心就是这个。

See git.gnome.org/browse/gvdb , dconf and bv git.gnome.org/browse/gvdb ,dconfbv
and developer.gnome.org/glib/2.30/glib-GVariant.html developer.gnome.org/glib/2.30/glib-GVariant.html

Just thought of another option - Datadraw . 刚想到另一种选择 - Datadraw Again, I haven't used this, so no guarantees, but it does claim to be a fast persistent database code generator. 同样,我没有使用过这个,所以没有保证,但它确实声称是一个快速持久的数据库代码生成器。

WRT boost.intrusive, I've just been having a look. WRT boos.intrusive,我刚看了一眼。 It's interesting. 这真有趣。 And annoying, as it makes one of my own libraries look a bit pointless. 而且很烦人,因为它使我自己的一个库看起来有点无意义。

I thought this section looked particularly relevant. 我认为这部分看起来特别相关。

If you can use "smart pointers" for links, presumably the smart pointer type can be implemented using a simple offset-from-base-address integer (and I think that's the point of the example). 如果你可以使用“智能指针”作为链接,大概可以使用一个简单的从基地址偏移的整数来实现智能指针类型(我认为这是示例的重点)。 An array subscript might be equally valid. 数组下标可能同样有效。

There's certainly unordered set/multiset support (C++ code for hash tables). 肯定有无序的set / multiset支持(哈希表的C ++代码)。

Using cmph would work. 使用cmph会起作用。 It does have the serialization machinery for the hash function itself, but you still need to serialize the keys and the data, besides adding a layer of collision resolution on top of it if your query set universe is not known before hand. 它确实具有哈希函数本身的序列化机制,但是如果你的查询集宇宙在手头不知道的话,你还需要序列化密钥和数据,除了在它上面添加一层冲突解决方案。 If you know all keys before hand, then it is the way to go since you don't need to store the keys and will save a lot of space. 如果您事先知道所有键,那么这是一种方法,因为您不需要存储键并且将节省大量空间。 If not, for such a small set, I would say it is overkill. 如果没有,对于这么小的一套,我会说它太过分了。

Probably the best option is to use google's sparse_hash_map. 可能最好的选择是使用谷歌的sparse_hash_map。 It has very low overhead and also has the serialization hooks that you need. 它具有非常低的开销,并且还具有您需要的序列化挂钩。

http://google-sparsehash.googlecode.com/svn/trunk/doc/sparse_hash_map.html#io http://google-sparsehash.googlecode.com/svn/trunk/doc/sparse_hash_map.html#io

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM