简体繁体 English

大数据集小数据的多重索引：空间效率低下？

[英]Multiple indexing with big data set of small data: space inefficient?

原文 2011-07-22 15:43:36 6 3 c++/ database-design/ boost-multi-index

I am not at all an expert in database design, so I will put my need in plain words before I try to translate it in CS terms: I am trying to find the right way to iterate quickly over large subsets (say ~100Mo of double) of data, in a potentially very large dataset (say several Go).我根本不是数据库设计方面的专家，所以在我尝试用 CS 术语翻译它之前，我会用简单的语言表达我的需求：我正在尝试找到正确的方法来快速迭代大型子集（比如 ~100Mo 的 double ) 的数据，在一个可能非常大的数据集中（比如几个 Go）。 I have objects that basically consist of 4 integers (keys) and the value, a simple struct (1 double 1 short).我有基本上由 4 个整数（键）和值组成的对象，一个简单的结构（1 双 1 短）。 Since my keys can take only a small number of values (couple hundreds) I thought it would make sense to save my data as a tree (1 depth by key, values are the leaves, much like XML's XPath in my naive view at least).由于我的键只能采用少量值（数百个），我认为将我的数据保存为一棵树是有意义的（键为 1 个深度，值是叶子，至少在我的幼稚观点中很像 XML 的 XPath） .

I want to be able to iterate through subset of leaves based on key values / a fonction of those keys values.我希望能够根据键值/这些键值的函数遍历叶子子集。 Which key combination to filter upon will vary.要过滤的组合键会有所不同。 I think this is call a transversal search?我认为这称为横向搜索？
So to avoid comparing n times the same keys, ideally I would need the data structure to be indexed by each of the permutation of the keys (12 possibilities: .4/!2 ).因此，为了避免比较 n 次相同的键，理想情况下，我需要通过键的每个排列来索引数据结构（12 种可能性： .4/!2 ）。 This seems to be what boost::multi_index is for, but, unless I'm overlooking smth, the way this would be done would be actually constructing those 12 tree structure, storing pointers to my value nodes as leaves.这似乎是boost::multi_index的用途，但是，除非我忽略了 smth，否则这样做的方式实际上是构建这 12 个树结构，将指向我的值节点的指针存储为叶子。 I guess this would be extremely space inefficient considering the small size of my values compared to the keys.考虑到与键相比，我的值很小，我想这将是非常空间效率低下的。

Any suggestions regarding the design / data structure I should use, or pointers to concise educational materials regarding these topics would be very appreciated.任何关于我应该使用的设计/数据结构的建议，或者关于这些主题的简明教育材料的指针都将非常感激。

3 个解决方案

With Boost.MultiIndex, you don't need as many as 12 indices (BTW, the number of permutations of 4 elements is 4,=24: not 12) to cover all queries comprising a particular subset of 4 keys: thanks to the use of composite keys , and with a little ingenuity, 6 indices suffice.使用 Boost.MultiIndex，您不需要多达 12 个索引（顺便说一句，4 个元素的排列数是 4，=24：不是 12）来涵盖包含 4 个键的特定子集的所有查询：感谢使用的组合键，并且有一点独创性，6个索引就足够了。

By some happy coincindence, I provided in my blog some years ago an example showing how to do this in a manner that almost exactly matches your particular scenario:巧合的是，几年前我在博客中提供了一个示例，展示了如何以几乎完全符合您的特定场景的方式执行此操作：

Multiattribute querying with Boost.MultiIndex 使用 Boost.MultiIndex 进行多属性查询

Source code is provided that you can hopefully use with little modification to suit your needs.提供了源代码，希望您只需稍作修改即可使用以满足您的需求。 The theoretical justification of the construct is also provided in a series of articles in the same blog:同一博客的一系列文章也提供了该构造的理论依据：

The maths behind this is not trivial and you might want to safely ignore it: if you need assistance understanding it, though, do not hesitate to comment on the blog articles.这背后的数学原理并非微不足道，您可能希望安全地忽略它：但是，如果您需要帮助理解它，请不要犹豫对博客文章发表评论。

How much memory does this container use?这个容器用了多少memory？ In a typical 32-bit computer, the size of your objects is 4*sizeof(int)+sizeof(double)+sizeof(short)+padding, which typically yields 32 bytes (checked with Visual Studio on Win32).在典型的 32 位计算机中，对象的大小为 4*sizeof(int)+sizeof(double)+sizeof(short)+padding，通常产生 32 个字节（在 Win32 上使用 Visual Studio 检查）。 To this Boost.MultiIndex adds an overhead of 3 words (12 bytes) per index, so for each element of the container you've got对此 Boost.MultiIndex 为每个索引添加了 3 个字（12 个字节）的开销，因此对于容器的每个元素，您已经拥有

32+6*12 = 104 bytes + padding. 32+6*12 = 104 字节 + 填充。

Again, I checked with Visual Studio on Win32 and the size obtained was 128 bytes per element.同样，我在 Win32 上使用 Visual Studio 进行了检查，得到的大小是每个元素 128 字节。 If you have 1 billion (10^9) elements, then 32 bits is not enough: going to a 64-bit OS will most likely double the size of obejcts, so the memory needed would amount to 256 GB, which is quite a powerful beast (don't know whether you are using something as huge as this.)如果你有 10 亿 (10^9) 个元素，那么 32 位是不够的：转到 64 位操作系统很可能会使对象的大小增加一倍，因此所需的 memory 将达到 256 GB，这是相当强大的野兽（不知道你用的是不是这么大的东西。）

B-Tree index and Bitmap Index are two of the major indexes used, but they aren't the only ones. B-Tree 索引和Bitmap 索引是使用的两个主要索引，但它们不是唯一的。 You should explore them.你应该探索它们。 Something to get you started .让你开始的东西。

Article evaluating when to use B-Tree and when to use Bitmap 评估何时使用 B-Tree 以及何时使用 Bitmap 的文章

It depends on the algorithm accessing it, honestly.老实说，这取决于访问它的算法。 If this structure needs to be resident, and you can afford the memory consumption, then just do it.如果这个结构需要常驻，并且你负担得起memory的消耗，那就去做吧。 multi_index is fine, though it will destroy your compile times if it's in a header. multi_index 很好，但如果它位于 header 中，它会破坏您的编译时间。

If you just need a one time traversal, then building the structure will be kind of a waste.如果您只需要一次遍历，那么构建结构将是一种浪费。 Something like next_permutation may be a good place to start.像next_permutation这样的东西可能是一个不错的起点。