用C语言优化代码

Question

I am trying to optimize a code in C, specificly a critical loop which takes almost 99.99% of total execution time. 我正在尝试优化C语言中的代码，特别是关键循环，该循环几乎占用了总执行时间的99.99％。 Here is that loop: 这是循环：

#pragma omp parallel shared(NTOT,i) num_threads(4)
{
  # pragma omp for private(dx,dy,d,j,V,E,F,G) reduction(+:dU) nowait
  for(j = 1; j <= NTOT; j++){
    if(j == i) continue;
    dx = (X[j][0]-X[i][0])*a;
    dy = (X[j][1]-X[i][1])*a;
    d = sqrt(dx*dx+dy*dy);
    V = (D/(d*d*d))*(dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]);
    E = dS[0]*dx+dS[1]*dy;
    F = spin[2*j-2]*dx+spin[2*j-1]*dy;
    G = -3*(D/(d*d*d*d*d))*E*F;
    dU += (V+G);
  }
}

All variables are local. 所有变量都是局部的。 The loop takes 0.7 second for NTOT=3600 which is a large amount of time, especially when I have to do this 500,000 times in the whole program, resulting in 97 hours spent in this loop. 对于NTOT = 3600，该循环需要0.7秒，这是一个很长的时间，尤其是当我必须在整个程序中执行500,000次时，在该循环中花费了97个小时。 My question is if there are other things to be optimized in this loop? 我的问题是在此循环中是否还有其他要优化的事情？

My computer's processor is an Intel core i5 with 4 CPU(4X1600Mhz) and 3072K L3 cache. 我计算机的处理器是具有4个CPU（4X1600Mhz）和3072K L3高速缓存的Intel Core i5。

Answer 1

Optimize for hardware or software? 针对硬件或软件进行优化？

Soft: 柔软的：

Getting rid of time consuming exceptions such as divide by zeros: 摆脱耗时的异常，例如被零除：

d = sqrt(dx*dx+dy*dy   + 0.001f   );
V = (D/(d*d*d))*(dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]);

You could also try John Carmack , Terje Mathisen and Gary Tarolli 's "Fast inverse square root" for the 您也可以尝试John Carmack，Terje Mathisen和Gary Tarolli的“快速反平方根”算法。

   D/(d*d*d)

part. 部分。 You get rid of division too. 您也摆脱了分裂。

  float qrsqrt=q_rsqrt(dx*dx+dy*dy + easing);
  qrsqrt=qrsqrt*qrsqrt*qrsqrt * D;

with sacrificing some precision. 牺牲一些精度。

There is another division also to be gotten rid of: 还有另一个要摆脱的分歧：

(D/(d*d*d*d*d))

such as 如

 qrsqrt_to_the_power2 * qrsqrt_to_the_power3 * D

Here is the fast inverse sqrt: 这是快速逆平方根：

float Q_rsqrt( float number )
{
    long i;
    float x2, y;
    const float threehalfs = 1.5F;

    x2 = number * 0.5F;
    y  = number;
    i  = * ( long * ) &y;                       // evil floating point bit level hacking
    i  = 0x5f3759df - ( i >> 1 );               // what ?
    y  = * ( float * ) &i;
    y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
//  y  = y * ( threehalfs - ( x2 * y * y ) );   // 2nd iteration, this can be removed

    return y;
}

To overcome big arrays' non-caching behaviour, you can do the computation in smaller patches/groups especially when is is many to many O(N*N) algorithm. 为了克服大阵列的非缓存行为，您可以在较小的补丁/组中进行计算，尤其是在多对多O（N * N）算法中。 Such as: 如：

get 256 particles.
compute 256 x 256 relations.
save 256 results on variables.

select another 256 particles as target(saving the first 256 group in place)
do same calculations but this time 1st group vs 2nd group.

save first 256 results again.

move to 3rd group

repeat.

do same until all particles are versused against first 256 particles.

Now get second group of 256. 

iterate until all 256's are complete.

Your CPU has big cache so you can try 32k particles versus 32k particles directly. 您的CPU具有很大的缓存，因此您可以直接尝试32k粒子与32k粒子。 But L1 may not be big so I would stick with 512 vs 512(or 500 vs 500 to avoid cache line ---> this is going to be dependent on architecture) if I were you. 但是L1可能不会很大，所以如果我是我，我会坚持使用512 vs 512（或500 vs 500，以避免缓存行--->这将取决于体系结构）。

Hard: 硬：

SSE, AVX, GPGPU, FPGA ..... SSE，AVX，GPGPU，FPGA .....

As @harold commented, SSE should be start point to compare and you should vectorize or at least parallelize through 4-packed vector instructions which have advantage of optimum memory fetching ability and pipelining. 正如@harold所说，SSE应该是比较的起点，您应该向量化或至少通过4个打包的向量指令进行并行化，这些指令具有最佳的内存获取能力和流水线化的优势。 When you need 3x-10x more performance(on top of SSE version using all cores), you will need an opencl/cuda compliant gpu(equally priced as i5) and opencl(or cuda) api or you can learn opengl too but it seems harder(maybe directx easier). 当您需要3到10倍的性能提升（在使用所有内核的SSE版本之上）时，您将需要与opencl / cuda兼容的gpu（价格等同于i5）和opencl（或cuda）api，或者您也可以学习opengl，但是看来更难（也许DirectX更容易）。

Trying SSE is easiest, should give 3x faster than the fast inverse I mentionad above. 尝试SSE最简单，应该比我上面提到的快速逆运算快3倍。 An equally priced gpu should give another 3x of SSE at least for thousands of particles. 价格相等的GPU至少应为数千个粒子提供3倍的SSE。 Going or over 100k particles, whole gpu can achieve 80x performance of a single core of cpu for this type of algorithm when you optimize it enough(making it less dependent to main memory). 达到或超过100k粒子时，如果对这种类型的算法进行充分优化（使其对主内存的依赖性降低），则整个gpu可以实现cpu单核性能的80倍。 Opencl gives ability to address cache to save your arrays. Opencl提供了解决高速缓存以保存阵列的功能。 So you can use terabytes/s of bandwidth in it. 因此，您可以在其中使用TB级的带宽。

Answer 2

I would always do random pausing to pin down exactly which lines were most costly. 我总是会随机暂停，以准确确定哪一行最昂贵。 Then, after fixing something I would do it again, to find another fix, and so on. 然后，在修复某些问题之后，我会再次进行此操作，以找到另一个修复程序，依此类推。

That said, some things look suspicious. 就是说，有些事情看起来可疑。 People will say the compiler's optimizer should fix these, but I never rely on that if I can help it. 人们会说编译器的优化器应该修复这些问题，但是如果能提供帮助，我将永远不会依赖它。

X[i] , X[j] , spin[2*j-1(and 2)] look like candidates for pointers. X[i] ， X[j] ， spin[2*j-1(and 2)]看起来像是候选指针。 There is no need to do this index calculation and then hope the optimizer can remove it. 无需进行此索引计算，然后希望优化器将其删除。
You could define a variable d2 = dx*dx+dy*dy and then say d = sqrt(d2) . 您可以定义变量d2 = dx*dx+dy*dy ，然后说d = sqrt(d2) 。 Then wherever you have d*d you can instead write d2 . 然后，无论您有d*d哪里，您都可以写d2 。
I suspect a lot of samples will land in the sqrt function, so I would try to figure a way around using that. 我怀疑很多样本都将包含在sqrt函数中，因此我将尝试找到一种使用该方法的方法。
I do wonder if some of these quantities like (dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]) could be calculated in a separate unrolled loop outside this loop. 我确实想知道是否可以在此循环之外的单独展开循环中计算其中某些量，例如(dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]) 。 In some cases two loops can be faster than one if the compiler can save some registers. 在某些情况下，如果编译器可以保存一些寄存器，则两个循环可能比一个循环快。

Answer 3

I cannot believe that 3600 iterations of an O(1) loop can take 0.7 seconds. 我不敢相信O（1）循环的3600次迭代可能需要0.7秒。 Perhaps you meant the double loop with 3600 * 3600 iterations? 也许您是说3600 * 3600迭代的双循环？ Otherwise I can suggest checking if optimization is enabled, and how long threads spawning takes. 否则，我建议检查是否启用了优化，以及产生线程需要多长时间。

General 一般

Your inner loop is very simple and it contains only a few operations. 您的内部循环非常简单，仅包含一些操作。 Note that divisions and square roots are roughly 15-30 times slower than additions, subtractions and multiplications. 请注意，除法和平方根大约比加法，减法和乘法慢15到30倍。 You are doing three of them, so most of the time is eaten by them. 您正在做三个，所以大部分时间都被他们吃掉了。

First of all, you can compute reciprocal square root in one operation instead of computing square root, then getting reciprocal of it. 首先，您可以在一个操作中计算平方根的倒数，而不是计算平方根，然后求倒数。 Second, you should save the result and reuse it when necessary (right now you divide by d twice). 其次，您应该保存结果并在必要时重复使用（现在，您将d除以两次）。 This would result in one problematic operation per iteration instead of three. 这将导致每个迭代一次有问题的操作，而不是三个。

invD = rsqrt(dx*dx+dy*dy);
V = (D * (invD*invD*invD))*(...);
...
G = -3*(D * (invD*invD*invD*invD*invD))*E*F;
dU += (V+G);

In order to further reduce time taken by rsqrt , I advise vectorizing it. 为了进一步减少rsqrt花费的时间，我建议将其向量化。 I mean: compute rsqrt for two or four input values at once with SSE. 我的意思是：使用SSE一次计算两个或四个输入值的rsqrt 。 Depending on size of your arguments and desired precision of result, you can take one of the routines from this question . 根据参数的大小和所需的结果精度，可以从此问题中选择一种例程。 Note that it contains a link to a small GitHub project with all the implementations. 请注意，它包含一个包含所有实现的小型GitHub项目的链接。

Indeed you can go further and vectorize the whole loop with SSE (or even AVX), that is not hard. 实际上，您可以走得更远，并使用SSE（甚至AVX）对整个循环进行矢量化处理，这并不难。

OpenCL OpenCL的

If you are ready to use some big framework, then I suggest using OpenCL. 如果您准备使用一些大型框架，则建议使用OpenCL。 Your loop is very simple, so you won't have any problems porting it to OpenCL (except for some initial adaptation to OpenCL). 您的循环非常简单，因此将其移植到OpenCL不会有任何问题（除了对OpenCL的一些初始适应之外）。

Then you can use CPU implementations of OpenCL, eg from Intel or AMD. 然后，您可以使用OpenCL的CPU实现，例如Intel或AMD的。 Both of them would automatically use multithreading. 它们都将自动使用多线程。 Also, they are likely to automatically vectorize your loop (eg see this article ). 另外，它们可能会自动向量化您的循环（例如，请参见本文）。 Finally, there is a chance that they would find a good implementation of rsqrt automatically, if you use native_rsqrt function or something like that. 最后，如果您使用native_rsqrt函数或类似的方法，他们有可能会自动找到rsqrt的良好实现。

Also, you would be able to run your code on GPU. 另外，您将能够在GPU上运行代码。 If you use single precision, it may result in significant speedup. 如果使用单精度，则可能会显着提高速度。 If you use double precision, then it is not so clear: modern consumer GPUs are often slow with double precision, because they lack the necessary hardware. 如果使用双精度，那么就不是很清楚：现代消费类GPU往往速度较慢，并且具有双精度，因为它们缺少必要的硬件。

Answer 4

Minor optimisations: 次要优化：

(d * d * d) is calculated twice. （d * d * d）计算两次。 Store d*d and use it for d^3 and d^5 存储d * d并将其用于d ^ 3和d ^ 5
Modify 2 * x by x<<1; 用x << 1修改2 * x;

用C语言优化代码

问题描述

4 个解决方案

解决方案1
3 2015-11-17 22:49:28

解决方案2
1 2015-11-18 00:08:37

解决方案3
1 2015-11-18 10:07:14

General 一般

OpenCL OpenCL的

解决方案4
0 2015-11-18 10:13:51

用C语言优化代码

问题描述

4 个解决方案

解决方案1 3 2015-11-17 22:49:28

解决方案2 1 2015-11-18 00:08:37

解决方案3 1 2015-11-18 10:07:14

General 一般

OpenCL OpenCL的

解决方案4 0 2015-11-18 10:13:51

解决方案1
3 2015-11-17 22:49:28

解决方案2
1 2015-11-18 00:08:37

解决方案3
1 2015-11-18 10:07:14

解决方案4
0 2015-11-18 10:13:51