大数据集进行处理，需要维护原始数据集

Question

Here's my problem definition: I have an array of seven million indices , each containing a label. 这是我的问题定义：我有700万个索引数组，每个索引包含一个标签。 So, for simplicity, here's an example array that I'm dealing with: [1 2 3 3 3 5 2 1 7]. 因此，为简单起见，这是我要处理的示例数组：[1 2 3 3 3 5 2 1 7]。

I need to go through this array and every time I come across a label, input the location of the label into a "set" with all others of the same label. 我需要遍历这个数组，每次遇到标签时，都将标签的位置与所有其他标签相同的位置输入到“集合”中。 With the array being so large, I want to access only a specific label's location at any given point, so let's say, I want to access only the locations of 3 and process those locations and change them to 5's, but I want to do more than just one operation and not only that, I want to do it on all labels, separately. 由于数组是如此之大，所以我只想在任何给定点访问特定标签的位置，所以，我想仅访问3的位置并处理这些位置并将其更改为5，但是我想做更多不仅是一项操作，而且不仅如此，我想在所有标签上分别进行操作。 In a small array like in my example, it seems trivial just to stick with the array. 在像我的示例这样的小型数组中，仅仅坚持使用数组似乎是微不足道的。 However, with an array of seven million points, it is much more time expensive to complete the searching for said labels and then operate on them. 但是，由于阵列中有700万个点，因此完成搜索所述标签然后对其进行操作所花费的时间要多得多。

To clear up confusion, using my example, I want the example array to give me the following: 为了消除混乱，使用我的示例，我希望example数组提供以下内容：

1 mapped to a set containing 0 and 7 1映射到包含0和7的集合
2 mapped to a set containing 1 and 6 2映射到包含1和6的集合
3 mapped to a set containing 2, 3, and 4 3映射到包含2、3和4的集合
5 mapped to a set containing 5 5映射到包含5的集合

Originally, I did my processing on the original array and simply operated on the array. 最初，我在原始阵列上进行处理，然后简单地在阵列上进行操作。 This took roughly ~30 seconds to determine the number of corresponding indices for each label (so I was able to determine that the size of 1 was two, size of six was two, size of 3 was three, etc. However, it did not produce the locations of said labels using this method. Therefore, there was added time throughout the rest of my processing finding the locations of each label as well although it was sped up by adding the termination that once it found all the indices of the referenced label, to stop searching. 这大约需要30秒钟才能确定每个标签的相应索引数（因此我能够确定1的大小是2，6的大小是2，3的大小是3，依此类推。因此，在我处理剩余的所有过程中，也增加了时间来查找每个标签的位置，尽管通过添加终止符（一旦找到引用标签的所有索引）就加快了速度，停止搜索。

Next step, I used a map<int,set<int>> but this ultimately led to an increase in time to ~100 seconds but decreased time in processing later down the road, but not enough to justify the heavy increase in time. 下一步，我使用了map<int,set<int>>但这最终导致时间增加到〜100秒，但是在以后的处理中减少了时间，但不足以证明大量增加时间是合理的。

I haven't implemented it yet, but as an additional step, I am planning on trying to initialize an array of sets, with the indices corresponding to the label and trying to do it this method. 我还没有实现它，但是作为一个额外的步骤，我打算尝试初始化一组数组，并使用与标签相对应的索引，并尝试使用此方法。

I have also tried hash_maps as well to no avail. 我也尝试过hash_maps，但无济于事。 Unordered_sets and unordered_maps are not included in the STL in Visual Studio 2005 so I have not implemented the above with these two structures. 在Visual Studio 2005的STL中不包含Unordered_sets和unordered_maps，因此我没有用这两种结构实现上述内容。

Key points: I have pre-processed the array such that I know the maximum label, and that all labels are consecutive (there are no gaps between the minimum label and the maximum). 要点：我已经对数组进行了预处理，以便知道最大标签，并且所有标签都是连续的（最小标签和最大标签之间没有间隙）。 However, they are randomly dispersed in the original array. 但是，它们随机分散在原始阵列中。 This may prove useful in initialization of a set size data structure. 这可能在设置大小数据结构的初始化中很有用。 Order of the indices corresponding to the labels is not important. 与标签相对应的索引的顺序并不重要。 Order of the labels in their given data structure is also not important. 标签在其给定数据结构中的顺序也不重要。

Edit: For background, the array corresponds to a binary image, and I implemented binary sequential labeling to output an array of same size as the binary image of UINT16 with all binary blobs labeled. 编辑：对于背景，该数组对应于一个二进制图像，我实现了二进制顺序标记，以输出与UINT16二进制图像大小相同的数组，并标记了所有二进制blob。 What I want to do now is to obtain a map of the points that make up each blob as efficiently as possible. 我现在要做的是尽可能有效地获取组成每个斑点的点的映射。

Answer 1

Why do you use such complicated data structures for that task? 您为什么要使用如此复杂的数据结构来完成这项任务？ Just create a vector of vectors to store all the positions of each label and that's it. 只需创建一个向量向量即可存储每个标签的所有位置，仅此而已。 And you also can avoid annoying vector memory allocation by pre-processing how much space you need for each label. 而且，您还可以通过预处理每个标签所需的空间来避免烦人的向量内存分配。 Something like that: 像这样：

vector <int> count(N);
for(size_t i = 0; i < N; ++i)
    ++count[dataArray[i]];
vector < vector <int> > labels(N);
vector <int> last(N);
for(size_t i = 0; i < N; ++i)
    labels[i].resize(count[i]);
for(size_t i = 0; i < N; ++i) {
    labels[dataArray[i]][last[dataArray[i]]] = i;
    ++last[dataArray[i]];
}

It will work in O(N) time, what looks like 1 second for your seven million of integers. 它将在O（N）时间内工作，对于您的700万个整数，这看起来像1秒。

Answer 2

I wouldn't necessarily use general purpose maps (or hash tables) for this. 我不必为此使用通用映射（或哈希表）。

My initial gut feeling is that I'd create a second array "positions" of seven million (or whatever N) locations, and a third array "last_position_for_index" corresponding to the range [min-label, max-label]. 我最初的直觉是，我将创建第二个数组“ positions”，该数组包含700万个（或任意N个）位置，以及第三个数组“ last_position_for_index”，对应于范围[min-label，max-label]。 Note that this will almost certainly take less storage than any kind of map. 请注意，这几乎肯定会比任何种类的地图占用更少的存储空间。

Initialize all the entries of last_position_for_index to some reserved value, and then you can just loop through your array with something like (untested): 将last_position_for_index的所有条目初始化为某个保留值，然后可以使用（unested）之类的内容遍历数组：

for (std::size_t k = 0; k<N; ++k) {
  IndexType index = indices[k];
  positions[k] = last_position_for_index[index-min_label];
  last_position_for_index[index-min_label] = k;
}

大数据集进行处理，需要维护原始数据集

问题描述

2 个解决方案

解决方案1
1 已采纳 2013-06-03 17:14:04

解决方案2
0 2013-06-03 17:01:01

大数据集进行处理，需要维护原始数据集

问题描述

2 个解决方案

解决方案1 1 已采纳 2013-06-03 17:14:04

解决方案2 0 2013-06-03 17:01:01

解决方案1
1 已采纳 2013-06-03 17:14:04

解决方案2
0 2013-06-03 17:01:01