海量数据集中的内存优化

Question

Deal all, I have implemented some functions and like to ask some basic thing as I do not have a sound fundamental knowledge on C++. 不好意思，我已经实现了一些功能，并且想问一些基本的问题，因为我对C ++没有足够的基础知识。 I hope, you all would be kind enough to tell me what should be the good way as I can learn from you. 希望大家能告诉我什么是应该向我学习的好方法。 (Please, this is not a homework and i donot have any experts arround me to ask this) （拜托，这不是功课，我没有专家请我回答这个问题）

What I did is; 我所做的是 I read the input x,y,z, point data (around 3GB data set) from a file and then compute one single value for each point and store inside a vector (result). 我从文件中读取输入的x，y，z点数据（大约3GB数据集），然后为每个点计算一个单个值并存储在向量中（结果）。 Then, it will be used in next loop. 然后，它将在下一个循环中使用。 And then, that vector will not be used anymore and I need to get that memory as it contains huge data set. 然后，该矢量将不再使用，我需要获取该内存，因为它包含大量数据集。 I think I can do this in two ways. 我想我可以通过两种方式做到这一点。 (1) By just initializing a vector and later by erasing it (see code-1). （1）只需初始化向量，然后再擦除向量（请参见代码1）。 (2) By allocating a dynamic memory and then later de-allocating it (see code-2). （2）通过分配动态内存，然后再取消分配它（请参见代码2）。 I heard this de-allocation is inefficient as de-allocation again cost memory or maybe I misunderstood. 我听说这种取消分配效率很低，因为重新分配会再次占用内存，或者我可能会误解了。

Q1) I would like to know what would be the optimized way in terms of memory and efficiency. Q1）我想知道在内存和效率方面最优化的方法。

Q2) Also, I would like to know whether function return by reference is a good way of giving output. Q2）另外，我想知道按引用返回函数是否是提供输出的好方法。 (Please look at code-3) （请看代码3）

code-1 代码1

int main(){

    //read input data (my_data)

    vector<double) result;
    for (vector<Position3D>::iterator it=my_data.begin(); it!=my_data.end(); it++){

         // do some stuff and calculate a "double" value (say value)
         //using each point coordinate 

         result.push_back(value);

    // do some other stuff

    //loop over result and use each value for some other stuff
    for (int i=0; i<result.size(); i++){

        //do some stuff
    }

    //result will not be used anymore and thus erase data
    result.clear()

code-2 代码2

int main(){

    //read input data

    vector<double) *result = new vector<double>;
    for (vector<Position3D>::iterator it=my_data.begin(); it!=my_data.end(); it++){

         // do some stuff and calculate a "double" value (say value)
         //using each point coordinate 

         result->push_back(value);

    // do some other stuff

    //loop over result and use each value for some other stuff
    for (int i=0; i<result->size(); i++){

        //do some stuff
    }

    //de-allocate memory
    delete result;
    result = 0;
}

code03 代码03

vector<Position3D>& vector<Position3D>::ReturnLabel(VoxelGrid grid, int segment) const
{
  vector<Position3D> *points_at_grid_cutting = new vector<Position3D>;
  vector<Position3D>::iterator  point;

  for (point=begin(); point!=end(); point++) {

       //do some stuff         

  }
  return (*points_at_grid_cutting);
}

Answer 1

For such huge data sets I would avoid using std containers at all and make use of memory mapped files. 对于如此庞大的数据集，我将完全避免使用std容器，而使用内存映射文件。

If you prefer to go on with std::vector, use vector::clear() or vector::swap(std::vector()) to free memory allocated. 如果您希望继续使用std :: vector，请使用vector::clear()或vector::swap(std::vector())释放已分配的内存。

Answer 2

erase will not free the memory used for the vector. erase 不会释放用于向量的内存。 It reduces the size but not the capacity, so the vector still holds enough memory for all those doubles. 它减小了大小，但没有减小容量，因此向量仍然为所有这些双打保留足够的内存。

The best way to make the memory available again is like your code-1, but let the vector go out of scope: 使内存再次可用的最佳方法就像您的代码-1，但是让向量超出范围：

int main() {
    {
        vector<double> result;
        // populate result
        // use results for something
    }
    // do something else - the memory for the vector has been freed
}

Failing that, the idiomatic way to clear a vector and free the memory is: 失败的话，清除向量并释放内存的惯用方式是：

vector<double>().swap(result);

This creates an empty temporary vector, then it exchanges the contents of that with result (so result is empty and has a small capacity, while the temporary has all the data and the large capacity). 这将创建一个空的临时向量，然后将其内容与result交换（因此， result为空且容量较小，而临时项具有所有数据且容量较大）。 Finally, it destroys the temporary, taking the large buffer with it. 最后，它破坏了临时文件，并占用了较大的缓冲区。

Regarding code03: it's not good style to return a dynamically-allocated object by reference, since it doesn't provide the caller with much of a reminder that they are responsible for freeing it. 关于code03：通过引用返回动态分配的对象不是一种好的样式，因为它没有给调用者提供很多提醒，提醒他们他们有责任释放它。 Often the best thing to do is return a local variable by value: 通常最好的办法是按值返回局部变量：

vector<Position3D> ReturnLabel(VoxelGrid grid, int segment) const
{
  vector<Position3D> points_at_grid_cutting;
  // do whatever to populate the vector
  return points_at_grid_cutting;
}

The reason is that provided the caller uses a call to this function as the initialization for their own vector, then something called "named return value optimization" kicks in, and ensures that although you're returning by value, no copy of the value is made. 原因是，如果调用方使用对该函数的调用作为其自身向量的初始化，则将启动名为“命名的返回值优化”的操作，并确保尽管按值返回，但没有值的副本制作。

A compiler that doesn't implement NRVO is a bad compiler, and will probably have all sorts of other surprising performance failures, but there are some cases where NRVO doesn't apply - most importantly when the value is assigned to a variable by the caller instead of used in initialization. 没有实现NRVO的编译器是一个糟糕的编译器，并且可能会出现其他各种令人惊讶的性能故障，但是在某些情况下NRVO并不适用-最重要的是，当调用者将值分配给变量时而不是用于初始化。 There are three fixes for this: 有三个修复程序：

1) C++11 introduces move semantics, which basically sort it out by ensuring that assignment from a temporary is cheap. 1）C ++ 11引入了移动语义，它基本上通过确保临时变量的分配便宜来对其进行整理。

2) In C++03, the caller can play a trick called "swaptimization". 2）在C ++ 03中，调用者可以播放一个称为“ swaptimization”的技巧。 Instead of: 代替：

vector<Position3D> foo;
// some other use of foo
foo = ReturnLabel();

write: 写：

vector<Position3D> foo;
// some other use of foo
ReturnLabel().swap(foo);

3) You write a function with a more complicated signature, such as taking a vector by non-const reference and filling the values into that, or taking an OutputIterator as a template parameter. 3）您编写的函数签名更加复杂，例如通过非常量引用获取vector并将其填充到其中，或者将OutputIterator作为模板参数。 The latter also provides the caller with more flexibility, since they need not use a vector to store the results, they could use some other container, or even process them one at a time without storing the whole lot at once. 后者还为调用者提供了更大的灵活性，因为他们不需要使用vector来存储结果，因此他们可以使用其他容器，甚至一次处理一个容器而无需一次存储整个容器。

Answer 3

Your code seems like the computed value from the first loop is only used context-insensitively in the second loop. 您的代码看起来像是第一个循环中的计算值仅在第二个循环中上下文无关地使用。 In other words, once you have computed the double value in the first loop, you could act immediately on it, without any need to store all values at once. 换句话说，一旦在第一个循环中计算了double值，就可以立即对其执行操作，而无需一次存储所有值。

If that's the case, you should implement it that way. 如果是这样，您应该以这种方式实现。 No worries about large allocations, storage or anything. 不用担心大的分配，存储或其他任何事情。 Better cache performance. 更好的缓存性能。 Happiness. 幸福。

Answer 4

vector<double) result;
    for (vector<Position3D>::iterator it=my_data.begin(); it!=my_data.end(); it++){

         // do some stuff and calculate a "double" value (say value)
         //using each point coordinate 

         result.push_back(value);

If the "result" vector will end up having thousands of values, this will result in many reallocations. 如果“结果”向量最终将具有数千个值，则将导致许多重新分配。 It would be best if you initialize it with a large enough capacity to store, or use the reserve function : 最好用足够大的容量来初始化它，或者使用reserve函数：

vector<double) result (someSuitableNumber,0.0);

This will reduce the number of reallocation, and possible optimize your code further. 这将减少重新分配的次数，并可能进一步优化代码。

Also I would write : vector<Position3D>& vector<Position3D>::ReturnLabel(VoxelGrid grid, int segment) const 我也会写： vector<Position3D>& vector<Position3D>::ReturnLabel(VoxelGrid grid, int segment) const

Like this : 像这样：

void vector<Position3D>::ReturnLabel(VoxelGrid grid, int segment, vector<Position3D> & myVec_out) const //myVec_out is populated inside func

Your idea of returning a reference is correct, since you want to avoid copying. 您返回引用的想法是正确的，因为您要避免复制。

Answer 5

`Destructors in C++ must not fail, therefore deallocation does not allocate memory, because memory can't be allocated with the no-throw guarantee. `C ++中的析构函数一定不能失败，因此，释放不能分配内存，因为不能使用无抛出保证来分配内存。

Apart: Instead of looping multiple times, it is probably better if you do the operations in an integrated manner, ie instead of loading the whole dataset, then reducing the whole dataset, just read in the points one by one, and apply the reduction directly, ie instead of 此外：如果不进行多次循环，则最好以集成方式进行操作，即，不加载整个数据集，然后精简整个数据集，然后逐个读取点，然后直接应用精简，这可能会更好，即代替

load_my_data()
for_each (p : my_data)
    result.push_back(p)

for_each (p : result)
    reduction.push_back (reduce (p))

Just do 做就是了

file f ("file")
while (f)
    Point p = read_point (f)
    reduction.push_back (reduce (p))

If you don't need to store those reductions, simply output them sequentially 如果您不需要存储这些减少量，只需顺序输出

file f ("file")
while (f)
    Point p = read_point (f)
    cout << reduce (p)

Answer 6

code-1 will work fine and is almost the same as code-2, with no major advantages or disadvantages. 代码1可以正常工作，并且几乎与代码2相同，没有主要优点或缺点。

code03 Somebody else should answer that but i believe the difference between a pointer and a reference in this case would be marginal, I do prefer pointers though. code03其他人应该回答，但是我相信在这种情况下指针和引用之间的区别很小，不过我确实更喜欢指针。

That being said, I think you might be approaching the optimization from the wrong angle. 话虽如此，我认为您可能会从错误的角度进行优化。 Do you really need all points to compute the output of a point in your first loop? 您是否真的需要所有点来计算第一个循环中点的输出？ Or can you rewrite your algorithm to read only one point, compute the value as you would in your first loop and then use it immediately the way you want to? 还是可以重写算法以仅读取一个点，像在第一个循环中那样计算值，然后以所需方式立即使用它？ Maybe not with single Points, but with batches of points. 也许不是单点，而是成批的点。 That could potentially cut back on your memory require quite a bit with only a small increase in processing time. 这可能会减少您的内存需求，而处理时间只增加一点点。

海量数据集中的内存优化

问题描述

6 个解决方案

解决方案1
2 2011-11-21 12:24:22

解决方案2
1 已采纳 2011-11-21 12:23:33

解决方案3
1 2011-11-21 12:26:53

解决方案4
0 2011-11-21 12:32:01

解决方案5
0 2011-11-21 12:32:42

解决方案6
-1 2011-11-21 12:31:21

海量数据集中的内存优化

问题描述

6 个解决方案

解决方案1 2 2011-11-21 12:24:22

解决方案2 1 已采纳 2011-11-21 12:23:33

解决方案3 1 2011-11-21 12:26:53

解决方案4 0 2011-11-21 12:32:01

解决方案5 0 2011-11-21 12:32:42

解决方案6 -1 2011-11-21 12:31:21

解决方案1
2 2011-11-21 12:24:22

解决方案2
1 已采纳 2011-11-21 12:23:33

解决方案3
1 2011-11-21 12:26:53

解决方案4
0 2011-11-21 12:32:01

解决方案5
0 2011-11-21 12:32:42

解决方案6
-1 2011-11-21 12:31:21