简体   繁体   English

大型数据集的向量的替代方案? C ++

[英]Alternative to vectors for large data sets? C++

I am looking for a data structure that holds data in order that its inserted (like a vector) that needs to hold millions of unsigned longs. 我正在寻找一个保存数据的数据结构,以便其插入(如矢量)需要保存数百万个无符号长整型。 The key is that it needs to have a lookup thats better than O(logn), because it will get searched against a similar vector of the same size. 关键是它需要比O(logn)更好的查找,因为它将针对相同大小的相似向量进行搜索。 Is there something that exists like this? 是否存在这样的东西?

If I insert 10, 20, 30 and then iterate over the set, I need to guarantee the order of 10, 20, 30. My data is a string I converted into a unsigned long to reduce memory use, that is reverse decodable. 如果插入10、20、30,然后遍历该集合,则需要保证10、20、30的顺序。我的数据是一个字符串,我将其转换为无符号长整数以减少内存使用,这是可逆解码的。

EDIT: Since people are asking, I am comparing two vectors against each other (both very large in size) to get the difference. 编辑:由于人们在问,我正在将两个向量相互比较(两者的大小都很大)以得到差异。

Small Example: 小例子:

vector 1: 10 20 30 40 50 60

vector 2: 11 24 30 40 55 70 90

result:   30 40

I never used it myself and it might be out-of-date compared to recent C++ version features (last update is from 2011), but STXXL is meant to be a set of containers and algorithms built for very big amount of data. 我从来没有亲自使用过它,与最近的C ++版本功能(最新更新是2011年)相比,它可能已经过时了,但是STXXL的意思是一组为大量数据构建的容器和算法。 It might fit your need. 它可能适合您的需求。

The core of STXXL is an implementation of the C++ standard template library STL for external memory (out-of-core) computations, ie, STXXL implements containers and algorithms that can process huge volumes of data that only fit on disks. STXXL的核心是用于外部存储器(核外)计算的C ++标准模板库STL的实现,即STXXL实现了可以处理仅容纳在磁盘上的大量数据的容器和算法。 While the closeness to the STL supports ease of use and compatibility with existing applications, another design priority is high performance. 尽管与STL的紧密关系支持易用性和与现有应用程序的兼容性,但另一个设计优先级是高性能。

The key features of STXXL are: STXXL的主要功能是:

  • Transparent support of parallel disks. 透明支持并行磁盘。 The library provides implementations of basic parallel disk algorithms. 该库提供了基本并行磁盘算法的实现。 STXXL is the only external memory algorithm library supporting parallel disks. STXXL是唯一支持并行磁盘的外部存储器算法库。
  • The library is able to handle problems of very large size (tested to up to dozens of terabytes). 该库能够处理非常大的问题(经测试可达数十兆字节)。
  • Improved utilization of computer resources. 提高了计算机资源的利用率。 STXXL implementations of external memory algorithms and data structures benefit from overlapping of I/O and computation. 外部存储器算法和数据结构的STXXL实现受益于I / O和计算的重叠。
  • Small constant factors in I/O volume. I / O量中的常数常数较小。 A unique library feature called "pipelining" can save more than half the number of I/Os, by streaming data between algorithmic components, instead of temporarily storing them on disk. 通过在算法组件之间传输数据,而不是将它们临时存储在磁盘上,独特的库功能称为“流水线化”,可以节省一半以上的I / O。 A development branch supports asynchronous execution of the algorithmic components, enabling high-level task parallelism. 开发分支支持算法组件的异步执行,从而实现高级任务并行性。
  • Shorter development times due to well known STL-compatible interfaces for external memory algorithms and data structures. 由于外部存储器算法和数据结构的众所周知的STL兼容接口,缩短了开发时间。
  • STL algorithms can be directly applied to STXXL containers; STL算法可以直接应用于STXXL容器; moreover, the I/O complexity of the algorithms remains optimal in most of the cases. 此外,在大多数情况下,算法的I / O复杂度仍保持最佳状态。

For internal computation, parallel algorithms from the MCSTL or the libstdc++ parallel mode are optionally utilized, making the algorithms inherently benefit from multi-core parallelism. 对于内部计算,可以选择使用来自MCSTL或libstdc ++并行模式的并行算法,从而使该算法固有地受益于多核并行性。

A hash map is one way you will have faster lookup than a sorted vector. 哈希映射是一种比排序向量更快查找的方式。 You must have c++11 support to use it. 您必须具有c ++ 11支持才能使用它。
http://www.cplusplus.com/reference/unordered_map/unordered_map/ http://www.cplusplus.com/reference/unordered_map/unordered_map/
To preserve the order of the data the only way would be to maintain a vector beside it that stored the int's as well 要保留数据的顺序,唯一的方法是在其旁边还保存一个存储int的向量
Before you jump to using it you should consider how you are going to use this data structure (access pattern). 在开始使用它之前,您应该考虑如何使用此数据结构(访问模式)。 Also consider what the data you will be getting is likely to be. 还请考虑您将要获得的数据。
Here is boost's version of the same thing http://www.boost.org/doc/libs/1_53_0/doc/html/unordered.html 这是同一件事的boost版本http://www.boost.org/doc/libs/1_53_0/doc/html/unordered.html

I think what you should use is unordered_map combined with maybe a doubly-linked list for the order. 我认为您应该使用的是unordered_map以及订单的双向链接列表。

So every time you add a new item to your database you add it first to the front (or the end) of the linked list, and then add it to the hashmap where the key is the value (the unsigned int) and the "value" (from the key/value pair) is the pointer to the the object in the linked list. 因此,每次将新项目添加到数据库时,都应先将其添加到链接列表的开头(或末尾),然后再将其添加到哈希表,其中键是值(unsigned int)和“ value” “(来自键/值对)是指向链接列表中对象的指针。 So now if you want a fast lookup you look in the hashmap, and if you want to iterate by order you use the linked list. 因此,现在,如果要快速查找,可以在哈希图中查找,如果要按顺序进行迭代,则可以使用链接列表。 Of course when you want to remove an object you have to remove them from both, but complexity wise it's the same (O(1) amortized for everything). 当然,当您要删除一个对象时,必须同时将它们都删除,但是从复杂性角度讲,它是相同的(O(1)摊销所有内容)。

This will of course increase your memory by 2 or 3 compared to just using a hashmap. 与仅使用哈希图相比,这无疑将使您的内存增加2或3。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM