优化C ++代码以提高性能

Question

你能想出一些优化这段代码的方法吗？ 它意味着在ARMv7处理器（Iphone 3GS）中执行：

4.0%  inline float BoxIntegral(IplImage *img, int row, int col, int rows, int cols) 
      {
0.7%    float *data = (float *) img->imageData;
1.4%    int step = img->widthStep/sizeof(float);

        // The subtraction by one for row/col is because row/col is inclusive.
1.1%    int r1 = std::min(row,          img->height) - 1;
1.0%    int c1 = std::min(col,          img->width)  - 1;
2.7%    int r2 = std::min(row + rows,   img->height) - 1;
3.7%    int c2 = std::min(col + cols,   img->width)  - 1;

        float A(0.0f), B(0.0f), C(0.0f), D(0.0f);
8.5%    if (r1 >= 0 && c1 >= 0) A = data[r1 * step + c1];
11.7%   if (r1 >= 0 && c2 >= 0) B = data[r1 * step + c2];
7.6%    if (r2 >= 0 && c1 >= 0) C = data[r2 * step + c1];
9.2%    if (r2 >= 0 && c2 >= 0) D = data[r2 * step + c2];

21.9%   return std::max(0.f, A - B - C + D);
3.8%  }

所有这些代码都来自OpenSURF库。 这是函数的上下文（有些人要求上下文）：

//! Calculate DoH responses for supplied layer
void FastHessian::buildResponseLayer(ResponseLayer *rl)
{
  float *responses = rl->responses;         // response storage
  unsigned char *laplacian = rl->laplacian; // laplacian sign storage
  int step = rl->step;                      // step size for this filter
  int b = (rl->filter - 1) * 0.5 + 1;         // border for this filter
  int l = rl->filter / 3;                   // lobe for this filter (filter size / 3)
  int w = rl->filter;                       // filter size
  float inverse_area = 1.f/(w*w);           // normalisation factor
  float Dxx, Dyy, Dxy;

  for(int r, c, ar = 0, index = 0; ar < rl->height; ++ar) 
  {
    for(int ac = 0; ac < rl->width; ++ac, index++) 
    {
      // get the image coordinates
      r = ar * step;
      c = ac * step; 

      // Compute response components
      Dxx = BoxIntegral(img, r - l + 1, c - b, 2*l - 1, w)
          - BoxIntegral(img, r - l + 1, c - l * 0.5, 2*l - 1, l)*3;
      Dyy = BoxIntegral(img, r - b, c - l + 1, w, 2*l - 1)
          - BoxIntegral(img, r - l * 0.5, c - l + 1, l, 2*l - 1)*3;
      Dxy = + BoxIntegral(img, r - l, c + 1, l, l)
            + BoxIntegral(img, r + 1, c - l, l, l)
            - BoxIntegral(img, r - l, c - l, l, l)
            - BoxIntegral(img, r + 1, c + 1, l, l);

      // Normalise the filter responses with respect to their size
      Dxx *= inverse_area;
      Dyy *= inverse_area;
      Dxy *= inverse_area;

      // Get the determinant of hessian response & laplacian sign
      responses[index] = (Dxx * Dyy - 0.81f * Dxy * Dxy);
      laplacian[index] = (Dxx + Dyy >= 0 ? 1 : 0);

#ifdef RL_DEBUG
      // create list of the image coords for each response
      rl->coords.push_back(std::make_pair<int,int>(r,c));
#endif
    }
  }
}

一些问题：
函数是内联的是一个好主意吗？ 使用内联汇编会提供显着的加速吗？

Answer 1

专注于边缘，因此您无需在每行和每列中检查它们。 我假设这个调用是在嵌套循环中并且被调用很多。 这个功能将成为：

inline float BoxIntegralNonEdge(IplImage *img, int row, int col, int rows, int cols) 
{
  float *data = (float *) img->imageData;
  int step = img->widthStep/sizeof(float);

  // The subtraction by one for row/col is because row/col is inclusive.
  int r1 = row - 1;
  int c1 = col - 1;
  int r2 = row + rows - 1;
  int c2 = col + cols - 1;

  float A(data[r1 * step + c1]), B(data[r1 * step + c2]), C(data[r2 * step + c1]), D(data[r2 * step + c2]);

  return std::max(0.f, A - B - C + D);
}

你摆脱了每个min和两个条件的条件和分支以及每个if的分支。 如果您已满足条件，则只能调用此函数 - 在调用者中检查整行的一次而不是每个像素。

当你必须对每个像素进行处理时，我写了一些优化图像处理的技巧：

http://www.atalasoft.com/cs/blogs/loufranco/archive/2006/04/28/9985.aspx

博客中的其他内容：

您正在使用2次乘法重新计算图像数据中的位置（索引是乘法） - 您应该递增指针。
不是传入img，row，row，col和cols，而是传递指向要处理的精确像素的指针 - 这是通过递增指针而不是索引来获得的。
如果不执行上述操作，则步骤对于所有像素都是相同的，在调用者中计算并传入。如果执行1和2，则根本不需要步骤。

Answer 2

有一些地方可以重复使用临时变量，但是它是否会提高性能必须以dirkgently声明来衡量：

更改

  if (r1 >= 0 && c1 >= 0) A = data[r1 * step + c1]; 
  if (r1 >= 0 && c2 >= 0) B = data[r1 * step + c2]; 
  if (r2 >= 0 && c1 >= 0) C = data[r2 * step + c1]; 
  if (r2 >= 0 && c2 >= 0) D = data[r2 * step + c2];

至

  if (r1 >= 0) {
    int r1Step = r1 * step;
    if (c1 >= 0) A = data[r1Step + c1]; 
    if (c2 >= 0) B = data[r1Step + c2]; 
  }
  if (r2 >= 0) {
    int r2Step = r2 * step;
    if (c1 >= 0) C = data[r2Step + c1]; 
    if (c2 >= 0) D = data[r2Step + c2]; 
  }

实际上，如果if语句很少提供，则实际上可能会经常进行临时多工作。

Answer 3

您对四个变量A ， B ， C ， D不感兴趣，但只对组合A - B - C + D不感兴趣。

尝试

float result(0.0f);
if (r1 >= 0 && c1 >= 0) result += data[r1 * step + c1];
if (r1 >= 0 && c2 >= 0) result -= data[r1 * step + c2];
if (r2 >= 0 && c1 >= 0) result -= data[r2 * step + c1];
if (r2 >= 0 && c2 >= 0) result += data[r2 * step + c2];

if (result > 0f) return result;
return 0f;

Answer 4

编译器可能会在适当的位置自动处理inling。

没有任何关于上下文的知识。 if（r1> = 0 && c1> = 0）检查是否必要？

是否要求row和col参数> 0？

float BoxIntegral(IplImage *img, int row, int col, int rows, int cols) 
{
  assert(row > 0 && col > 0);
  float *data = (float*)img->imageData; // Don't use C-style casts
  int step = img->widthStep/sizeof(float);

  // Is the min check rly necessary?
  int r1 = std::min(row,          img->height) - 1;
  int c1 = std::min(col,          img->width)  - 1;
  int r2 = std::min(row + rows,   img->height) - 1;
  int c2 = std::min(col + cols,   img->width)  - 1;

  int r1_step = r1 * step;
  int r2_step = r2 * step;

  float A = data[r1_step + c1];
  float B = data[r1_step + c2];
  float C = data[r2_step + c1];
  float D = data[r2_step + c2];

  return std::max(0.0f, A - B - C + D);
}

Answer 5

我不确定您的问题是否适合SIMD，但这可能会让您立即对您的图像执行多项操作，并为您提供良好的性能改进。 我假设您正在内联和优化，因为您正在执行多次操作。 看一眼：

如果启用了正确的标志，编译器确实对Neon有一些支持，但你可能需要自己推出一些。

编辑要获得编辑器对neon的支持，您需要使用编译器标志-mfpu=neon

Answer 6

一些示例说直接初始化A ， B ， C和D并使用0跳过初始化，但这在某些方面在功能上与原始代码不同。 我会这样做：

inline float BoxIntegral(IplImage *img, int row, int col, int rows, int cols)  {

    const float *data = (float *) img->imageData;
    const int step = img->widthStep/sizeof(float);

    // The subtraction by one for row/col is because row/col is inclusive.
    const int r1 = std::min(row,          img->height) - 1;
    const int r2 = std::min(row + rows,   img->height) - 1;
    const int c1 = std::min(col,          img->width)  - 1;
    const int c2 = std::min(col + cols,   img->width)  - 1;

    const float A = (r1 >= 0 && c1 >= 0) ? data[r1 * step + c1] : 0.0f;
    const float B = (r1 >= 0 && c2 >= 0) ? data[r1 * step + c2] : 0.0f;
    const float C = (r2 >= 0 && c1 >= 0) ? data[r2 * step + c1] : 0.0f;
    const float D = (r2 >= 0 && c2 >= 0) ? data[r2 * step + c2] : 0.0f;

    return std::max(0.f, A - B - C + D);
}

与原始代码一样，如果条件为true ，这将使得A ， B ， C和D具有来自data[]的值，如果条件为假，则具有0.0f的值。 另外，我会（如我所示）在适当的地方使用const 。 许多编译器无法基于const -ness来改进代码，但是为编译器提供有关其运行的数据的更多信息肯定不会有害。 最后，我重新排序了r1 / r2 / c1 / c2变量，以鼓励重用获取的宽度和高度。

显然，您需要进行分析以确定其中是否有任何改进。

优化C ++代码以提高性能

问题描述

6 个解决方案

解决方案1
8 已采纳 2010-09-08 14:12:25

解决方案2
1 2010-09-08 14:11:46

解决方案3
1 2010-09-08 14:36:02

解决方案4
0 2010-09-08 14:12:33

解决方案5
0 2010-09-08 14:29:10

解决方案6
0 2010-09-08 14:37:01

优化C ++代码以提高性能

问题描述

6 个解决方案

解决方案1 8 已采纳 2010-09-08 14:12:25

解决方案2 1 2010-09-08 14:11:46

解决方案3 1 2010-09-08 14:36:02

解决方案4 0 2010-09-08 14:12:33

解决方案5 0 2010-09-08 14:29:10

解决方案6 0 2010-09-08 14:37:01

解决方案1
8 已采纳 2010-09-08 14:12:25

解决方案2
1 2010-09-08 14:11:46

解决方案3
1 2010-09-08 14:36:02

解决方案4
0 2010-09-08 14:12:33

解决方案5
0 2010-09-08 14:29:10

解决方案6
0 2010-09-08 14:37:01