与在优化和定时C ++代码中从函数中获取数据相关的无法解释的费用

Question

I have written optimized code for an algorithm that computes a vector of quantities. 我已经为计算数量向量的算法编写了优化代码。 I have timed it before and after various attempts at getting the data computed in the function out of the function. 我已经在各种尝试之前和之后定时将函数中的数据计算出来。 I think that the specific nature of the computation or the nature of the vector of quantities is not relevant. 我认为计算的具体性质或数量向量的性质是不相关的。 An outline of the code, timings, and details follow. 下面是代码，时间和细节的概述。

All code was compiled with the following flags: 所有代码都使用以下标志编译：

g++ -Wall -Wextra -Werror -std=c++11 -pedantic -O3 g ++ -Wall -Wextra -Werror -std = c ++ 11 -pedantic -O3

I have a class like this: 我有一个这样的课：

#ifndef C_H
#define C_H

#include <iostream>
#include <iterator>
#include <vector>
Class c {
    public:
        void doWork( int param1, int param2 ) const {
            std::array<unsigned long,40> counts = {{0}};
            // LOTS of branches and inexpensive operations:
            // additions, subtractions, incrementations, and dereferences
            for( /* loop 1 */ ) {
                // LOTS MORE branches and inexpensive operations
                counts[ /* data dependent */ ] += /* data dependent */;
                for( /* loop 2 */ ) {
                    // YET MORE branches inexpensive operations
                    counts[ /* data dependent */ ] += /* data dependent */;
                }
            }
            counts [ /* data dependent */ ] = /* data dependent */;
            /* exclude for profiling
            std::copy( &counts[0], &counts[40], std::ostream_iterator( std::cout, "," ) );
            std::cout << "\n";
            */
        }
    private:
        // there is private data here that is processed above
        // the results get added into the array/vector as they are computed
};

#endif

And a main like this: 像这样的主要：

#include <iostream>
#include "c.h"
int main( int argc, char * argv ) {
    Class c( //set the private data of c by passing data in );
    int param1;
    int param2;
    while( std::cin >> param1 >> param2 ) {
        c.doWork( int param1, int param2 );
    }
}

Here are some relevant details about the data: 以下是有关数据的一些相关详细信息：

20 million pairs read at standard input (redirected from a file) 标准输入读取2000万对（从文件重定向）
20 million calls to c.doWork 2000万次调用c.doWork
60 million TOTAL iterations through the outer loop in c.doWork 通过c.doWork中的外循环进行了6000万次TOTAL迭代
180 million TOTAL iterations through the inner loop in c.doWork 通过c.doWork中的内循环进行了1.8亿次迭代

All of this requires exactly 5 minutes and 48 seconds to run. 所有这些都需要5分48秒才能运行。 Naturally I can print the array within the class function, and that is what I have been doing, but I am going to release the code publicly, and some use cases may include wanting to do something other than printing the vector. 当然我可以在类函数中打印数组，这就是我一直在做的，但我将公开发布代码，并且一些用例可能包括除了打印向量之外还要做其他事情。 In that case, I need to change the function signature to actually get the data to the user. 在这种情况下，我需要更改函数签名以实际获取数据给用户。 This is where the problem arises. 这就是出现问题的地方。 Things that I have tried: 我尝试过的事情：

Creating a vector in main and passing it in by reference: 在main中创建一个向量并通过引用传入：
```
 std::vector<unsigned long> counts( 40 ); while( std::cin >> param1 >> param2 ) { c.doWork( param1, param2, counts ); std::fill( counts.begin(), counts.end(), 0 ); } 
```
This requires 7 minutes 30 seconds. 这需要7分30秒。 Removing the call to std::fill only reduces this by 15 seconds, so that doesn't account for the discrepancy. 删除对std :: fill的调用只会将此减少15秒，因此不会考虑差异。
Creating a vector within the doWork function and returning it, taking advantage of move semantics. 在doWork函数中创建向量并返回它，利用移动语义。 Since this requires a dynamic allocation for each result, I didn't expect this to be fast. 由于这需要为每个结果进行动态分配，因此我没想到这会很快。 Strangely it's not a lot slower. 奇怪的是，它并没有那么慢。 7 minutes 40 seconds. 7分40秒。
Returning the std::array currently in doWork by value. 按值返回当前在doWork中的std :: array。 Naturally this has to copy the data upon return since the stack array does not support move semantics. 当然，这必须在返回时复制数据，因为堆栈数组不支持移动语义。 7 minutes 30 seconds 7分30秒
Passing a std::array in by reference. 通过引用传递std :: array。
```
 while( std::cin >> param1 >> param2 ) { std::array<unsigned long,40> counts = {{0}}; c.doWork( param1, param2, counts ) } 
```
I would expect this to be roughly equivalent to the original. 我希望这大致相当于原版。 The data is placed on the stack in the main function, and it is passed by reference to doWork, which fills it. 数据放在main函数的堆栈中，并通过引用传递给doWork，doWork填充它。 7 minutes 20 seconds. 7分20秒。 This one really stymies me. 这个真的让我很难受。

I have not tried passing pointers in to doWork, because this should be equivalent to passing by reference. 我没有尝试将指针传递给doWork，因为这应该等同于通过引用传递。

One solution is naturally to have two versions of the function: one that prints locally and one that returns. 一个解决方案当然有两个版本的函数：一个在本地打印，另一个返回。 The roadblock is that I would have to duplicate ALL code, because the entire issue here is that I cannot efficiently get the results out of a function. 障碍是我必须复制所有代码，因为这里的整个问题是我无法有效地从函数中获取结果。

So I am mystified. 所以我很神秘。 I understand that any of these solutions require an extra dereference for every access to the array/vector inside doWork, but these extra dereferences are highly trivial compared to the huge number of other fast operations and more troublesome data-dependent branches. 据我所知，这些解决方案中的任何一个都需要对doWork中的数组/向量的每次访问进行额外的解引用，但与大量其他快速操作和更麻烦的数据相关分支相比，这些额外的解引用非常简单。

I welcome any ideas to explain this. 我欢迎任何想法来解释这一点。 My only thought is that the code is being optimized by the compiler so that some otherwise necessary components of computation are being omitted in the original case, because the compiler realizes that it is not necessary. 我唯一的想法是编译器正在优化代码，以便在原始情况下省略一些必要的计算组件，因为编译器意识到它不是必需的。 But this seems to be contraindicated on several counts: 但这似乎在几个方面是禁忌的：

Making changes to the code inside the loops does change the timings. 更改循环内的代码确实会改变计时。
The original timings are 5 minutes 50 seconds, whereas just reading the pairs from the file takes 12 seconds, so a lot is being done. 原始定时5分50秒，而只是从文件读出对需要12秒，因此很多正在做。
Maybe only operations involving counts are being optimized away, but that seems like a strangely selective optimization given that if those are being optimized away the compiler could realize that supporting computations in doWork are also unecessary. 也许只有涉及计数的操作才会被优化掉，但这似乎是一种奇怪的选择性优化，因为如果这些优化得到优化，编译器就会意识到在doWork中支持计算也是不必要的。
If operations involving counts ARE being optimized away, why are they not optimized in the other cases. 如果涉及计数的操作被优化掉，为什么它们在其他情况下没有被优化。 I am not actually using them in main. 我实际上并没有在主要使用它们。

Is it the case that doWork is compiled and optimized independently of main, and thus if the function has any obligation to return the data in any form it cannot be certain of whether it will be used or not? 是否就是doWork独立于main编译和优化的情况，因此如果函数有义务以任何形式返回数据，则无法确定它是否将被使用？

Is my method of profiling without printing, which was to avoid the cost of the printing to emphasize the relative differences in various methods, flawed? 我的剖析方法是不打印，这是为了避免印刷成本强调各种方法的相对差异，有缺陷吗？

I am grateful for any light you can shed. 我很感激你能摆脱的任何光芒。

Answer 1

What I would do is pause it a few times and see what it's doing most of the time. 我要做的是暂停一下，看看它大部分时间都在做什么。 Looking at your code, I would suspect the most time going into either a) the innermost loop, especially the index calculation, or 2) the allocation of the std::array . 看看你的代码，我怀疑最多的时间进入a）最内层循环，尤其是索引计算，或2） std::array的分配。

If the size of counts is always 40, I would just do 如果counts的大小总是40，我会这样做

  long counts[40];
  memset(counts, 0, sizeof(counts));

That allocates on the stack, which takes no time, and memset takes no time compared to whatever else you're doing. 这在堆栈上分配，这不需要时间，而memset与你正在做的其他事情相比没有时间。

If the size varies at runtime, then what I do is some static allocation, like this: 如果大小在运行时变化，那么我所做的是一些静态分配，如下所示：

void myRoutine(){
  /* this does not claim to be pretty. it claims to be efficient */
  static int nAlloc = 0;
  static long* counts = NULL;
  /* this initially allocates the array, and makes it bigger if necessary */
  if (nAlloc < /* size I need */){
    if (counts) delete counts;
    nAlloc = /* size I need */;
    counts = new long[nAlloc];
  }
  memset(counts, 0, sizeof(long)*nAlloc);
  /* do the rest of the stuff */
}

This way, counts is always big enough, and the point is to 1) do new as few times as possible, and 2) keep the indexing into counts as simple as possible. 这样， counts总是足够大，重点是1）尽可能少地执行new ，以及2）尽可能简单地将索引编入counts 。

But first I would do the pauses, just to be sure. 但首先我会做暂停，只是为了确定。 After fixing it, I would do that again to see what's the next thing I could fix. 修复之后，我会再次这样做，看看接下来我能解决的问题是什么。

Answer 2

Compiler optimizations are one place to look at but there is one more place that you need to look. 编译器优化是一个值得关注的地方，但还有一个地方需要查看。 Changes that you made in the code can disturb the cache layout. 您在代码中所做的更改可能会干扰缓存布局。 If memory allocated to the array is in a different part of memory each time, number of cache misses in your system can increase, which in turn degrades the performance. 如果分配给阵列的内存每次都在内存的不同部分，系统中的缓存未命中数会增加，从而降低性能。 You can take a look at hardware performance counters on your CPU to make a better guess about it. 您可以查看CPU上的硬件性能计数器，以便更好地猜测它。

Answer 3

There are times when unorthodox solutions are applicable, and this may be one. 有时非正统的解决方案适用，这可能是一个。 Have you considered making the array a global? 您是否考虑过将阵列变为全局？

Still, the one crucial benefit that local variables have is that the optimizer can find all access to it, using information from the function only. 尽管如此，局部变量的一个重要好处是优化器可以使用来自函数的信息找到对它的所有访问。 That makes register assignment a whole lot easier. 这使得寄存器分配变得更加容易。

A static variable inside the function is almost the same, but in your case the address of that stack array would escape, beating the optimizer once again. 函数内部的static变量几乎相同，但在您的情况下，该堆栈数组的地址将会逃逸，再次击败优化器。

与在优化和定时C ++代码中从函数中获取数据相关的无法解释的费用

问题描述

3 个解决方案

解决方案1
0 已采纳 2013-04-28 22:13:20

解决方案2
0 2013-04-28 22:14:21

解决方案3
0 2013-04-29 00:04:35

与在优化和定时C ++代码中从函数中获取数据相关的无法解释的费用

问题描述

3 个解决方案

解决方案1 0 已采纳 2013-04-28 22:13:20

解决方案2 0 2013-04-28 22:14:21

解决方案3 0 2013-04-29 00:04:35

解决方案1
0 已采纳 2013-04-28 22:13:20

解决方案2
0 2013-04-28 22:14:21

解决方案3
0 2013-04-29 00:04:35