简体   繁体   English

优化C代码

[英]Optimization of C code

For an assignment of a course called High Performance Computing, I required to optimize the following code fragment: 对于高性能计算课程的分配,我需要优化以下代码片段:

int foobar(int a, int b, int N)
{
    int i, j, k, x, y;
    x = 0;
    y = 0;
    k = 256;
    for (i = 0; i <= N; i++) {
        for (j = i + 1; j <= N; j++) {
            x = x + 4*(2*i+j)*(i+2*k);
            if (i > j){
               y = y + 8*(i-j);
            }else{
               y = y + 8*(j-i);
            }
        }
    }
    return x;
}

Using some recommendations, I managed to optimize the code (or at least I think so), such as: 使用一些建议,我设法优化代码(或至少我认为如此),例如:

  1. Constant Propagation 不断传播
  2. Algebraic Simplification 代数简化
  3. Copy Propagation 复制传播
  4. Common Subexpression Elimination 常见的Subexpression消除
  5. Dead Code Elimination 死代码消除
  6. Loop Invariant Removal 循环不变量删除
  7. bitwise shifts instead of multiplication as they are less expensive. 按位移位而不是乘法,因为它们更便宜。

Here's my code: 这是我的代码:

int foobar(int a, int b, int N) {

    int i, j, x, y, t;
    x = 0;
    y = 0;
    for (i = 0; i <= N; i++) {
        t = i + 512;
        for (j = i + 1; j <= N; j++) {
            x = x + ((i<<3) + (j<<2))*t;
        }
    }
    return x;
}

According to my instructor, a well optimized code instructions should have fewer or less costly instructions in assembly language level.And therefore must be run, the instructions in less time than the original code, ie calculations are made with:: 根据我的导师的说法,优化良好的代码指令应该在汇编语言级别中具有更少或更少成本的指令。因此必须运行,指令在比原始代码更短的时间内,即使用::

execution time = instruction count * cycles per instruction 执行时间=指令计数*每条指令的周期

When I generate assembly code using the command: gcc -o code_opt.s -S foobar.c , 当我使用以下命令生成汇编代码时: gcc -o code_opt.s -S foobar.c

the generated code has many more lines than the original despite having made ​​some optimizations, and run-time is lower, but not as much as in the original code. 尽管已经进行了一些优化,但生成的代码拥有比原始代码多得多的行,并且运行时间较低,但没有原始代码那么多。 What am I doing wrong? 我究竟做错了什么?

Do not paste the assembly code as both are very extensive. 不要粘贴汇编代码,因为两者都非常广泛。 So I'm calling the function "foobar" in the main and I am measuring the execution time using the time command in linux 所以我在main中调用函数“foobar”,我在linux中使用time命令测量执行时间

int main () {
    int a,b,N;

    scanf ("%d %d %d",&a,&b,&N);
    printf ("%d\n",foobar (a,b,N));
    return 0;
}

y does not affect the final result of the code - removed: y不影响代码的最终结果 - 删除:

int foobar(int a, int b, int N)
{
    int i, j, k, x, y;
    x = 0;
    //y = 0;
    k = 256;
    for (i = 0; i <= N; i++) {
        for (j = i + 1; j <= N; j++) {
            x = x + 4*(2*i+j)*(i+2*k);
            //if (i > j){
            //   y = y + 8*(i-j);
            //}else{
            //   y = y + 8*(j-i);
            //}
        }
    }
    return x;
}

k is simply a constant: k只是一个常数:

int foobar(int a, int b, int N)
{
    int i, j, x;
    x = 0;
    for (i = 0; i <= N; i++) {
        for (j = i + 1; j <= N; j++) {
            x = x + 4*(2*i+j)*(i+2*256);
        }
    }
    return x;
}

The inner expression can be transformed to: x += 8*i*i + 4096*i + 4*i*j + 2048*j . 内部表达式可以转换为: x += 8*i*i + 4096*i + 4*i*j + 2048*j Use math to push all of them to the outer loop: x += 8*i*i*(Ni) + 4096*i*(Ni) + 2*i*(Ni)*(N+i+1) + 1024*(Ni)*(N+i+1) . 使用math将它们全部推到外循环: x += 8*i*i*(Ni) + 4096*i*(Ni) + 2*i*(Ni)*(N+i+1) + 1024*(Ni)*(N+i+1)

You can expand the above expression, and apply sum of squares and sum of cubes formula to obtain a close form expression, which should run faster than the doubly nested loop. 您可以展开上面的表达式,并应用平方和和多维数据集公式的总和来获得一个紧密的表单表达式,它应该比双嵌套循环运行得更快。 I leave it as an exercise to you. 我把它作为锻炼留给你。 As a result, i and j will also be removed. 结果, ij也将被删除。

a and b should also be removed if possible - since a and b are supplied as argument but never used in your code. ab也应去除如果可能的话-因为ab作为参数提供,但在你的代码从来没有使用过。

Sum of squares and sum of cubes formula: 平方和和方块总和公式:

  • Sum(x 2 , x = 1..n) = n(n + 1)(2n + 1)/6 Sum(x 2 ,x = 1..n)= n(n + 1)(2n + 1)/ 6
  • Sum(x 3 , x = 1..n) = n 2 (n + 1) 2 /4 总和(X 3,X = 1..N)= N 2(N + 1)2/4

Initially: 原来:

for (i = 0; i <= N; i++) {
    for (j = i + 1; j <= N; j++) {
        x = x + 4*(2*i+j)*(i+2*k);
        if (i > j){
           y = y + 8*(i-j);
        }else{
           y = y + 8*(j-i);
        }
    }
}

Removing y calculations: 删除y计算:

for (i = 0; i <= N; i++) {
    for (j = i + 1; j <= N; j++) {
        x = x + 4*(2*i+j)*(i+2*k);
    }
}

Splitting i , j , k : 分裂ijk

for (i = 0; i <= N; i++) {
    for (j = i + 1; j <= N; j++) {
        x = x + 8*i*i + 16*i*k ;                // multiple of  1  (no j)
        x = x + (4*i + 8*k)*j ;                 // multiple of  j
    }
}

Moving them externally (and removing the loop that runs Ni times): 从外部移动它们(并移除运行Ni次数的循环):

for (i = 0; i <= N; i++) {
    x = x + (8*i*i + 16*i*k) * (N-i) ;
    x = x + (4*i + 8*k) * ((N*N+N)/2 - (i*i+i)/2) ;
}

Rewritting: Rewritting:

for (i = 0; i <= N; i++) {
    x = x +         ( 8*k*(N*N+N)/2 ) ;
    x = x +   i   * ( 16*k*N + 4*(N*N+N)/2 + 8*k*(-1/2) ) ;
    x = x +  i*i  * ( 8*N + 16*k*(-1) + 4*(-1/2) + 8*k*(-1/2) );
    x = x + i*i*i * ( 8*(-1) + 4*(-1/2) ) ;
}

Rewritting - recalculating: 重写 - 重新计算:

for (i = 0; i <= N; i++) {
    x = x + 4*k*(N*N+N) ;                            // multiple of 1
    x = x +   i   * ( 16*k*N + 2*(N*N+N) - 4*k ) ;   // multiple of i
    x = x +  i*i  * ( 8*N - 20*k - 2 ) ;             // multiple of i^2
    x = x + i*i*i * ( -10 ) ;                        // multiple of i^3
}

Another move to external (and removal of the i loop): 外部的另一个移动(并删除i循环):

x = x + ( 4*k*(N*N+N) )              * (N+1) ;
x = x + ( 16*k*N + 2*(N*N+N) - 4*k ) * ((N*(N+1))/2) ;
x = x + ( 8*N - 20*k - 2 )           * ((N*(N+1)*(2*N+1))/6);
x = x + (-10)                        * ((N*N*(N+1)*(N+1))/4) ;

Both the above loop removals use the summation formulas: 上述循环删除都使用求和公式:

Sum(1, i = 0..n) = n+1 Sum(1,i = 0..n)= n + 1
Sum(i 1 , i = 0..n) = n(n + 1)/2 Sum(i 1 ,i = 0..n)= n(n + 1)/ 2
Sum(i 2 , i = 0..n) = n(n + 1)(2n + 1)/6 Sum(i 2 ,i = 0..n)= n(n + 1)(2n + 1)/ 6
Sum(i 3 , i = 0..n) = n 2 (n + 1) 2 /4 总和(I 3,I = 0..N)= N 2(N + 1)2/4

This function is equivalent with the following formula, which contains only 4 integer multiplications , and 1 integer division : 此函数与以下公式等效,该公式仅包含4个整数乘法1个整数除法

x = N * (N + 1) * (N * (7 * N + 8187) - 2050) / 6;

To get this, I simply typed the sum calculated by your nested loops into Wolfram Alpha : 为此,我只需将嵌套循环计算的总和输入Wolfram Alpha

sum (sum (8*i*i+4096*i+4*i*j+2048*j), j=i+1..N), i=0..N

Here is the direct link to the solution. 是解决方案的直接链接。 Think before coding. 在编码前思考。 Sometimes your brain can optimize code better than any compiler. 有时你的大脑可以比任何编译器更好地优化代码。

Briefly scanning the first routine, the first thing you notice is that expressions involving "y" are completely unused and can be eliminated (as you did). 简单地扫描第一个例程,你注意到的第一件事是涉及“y”的表达式是完全未使用的并且可以被删除(正如你所做的那样)。 This further permits eliminating the if/else (as you did). 这进一步允许消除if / else(就像你一样)。

What remains is the two for loops and the messy expression. 剩下的是两个for循环和凌乱的表达。 Factoring out the pieces of that expression that do not depend on j is the next step. 下一步就是将那些不依赖于j表达式分解出来。 You removed one such expression, but (i<<3) (ie, i * 8) remains in the inner loop, and can be removed. 你删除了一个这样的表达式,但是(i<<3) (即i * 8)保留在内循环中,可以删除。

Pascal's answer reminded me that you can use a loop stride optimization. Pascal的回答提醒我,你可以使用循环步幅优化。 First move (i<<3) * t out of the inner loop (call it i1 ), then calculate, when initializing the loop, a value j1 that equals (i<<2) * t . 首先移动(i<<3) * t离开内循环(称之为i1 ),然后在初始化循环时计算等于(i<<2) * t的值j1 On each iteration increment j1 by 4 * t (which is a pre-calculated constant). 在每次迭代时,将j1增加4 * t (这是预先计算的常数)。 Replace your inner expression with x = x + i1 + j1; x = x + i1 + j1;替换内部表达式x = x + i1 + j1; .

One suspects that there may be some way to combine the two loops into one, with a stride, but I'm not seeing it offhand. 有人怀疑可能有某种方法可以将两个循环合二为一,但是我没有看到它。

A few other things I can see. 我能看到的其他一些事情。 You don't need y , so you can remove its declaration and initialisation. 您不需要y ,因此您可以删除其声明和初始化。

Also, the values passed in for a and b aren't actually used, so you could use these as local variables instead of x and t . 此外,实际上并未使用传入ab的值,因此您可以将它们用作局部变量而不是xt

Also, rather than adding i to 512 each time through you can note that t starts at 512 and increments by 1 each iteration. 此外,不是每次通过你添加i到512,你可以注意到t从512开始并且每次迭代增加1。

int foobar(int a, int b, int N) {
    int i, j;
    a = 0;
    b = 512;
    for (i = 0; i <= N; i++, b++) {
        for (j = i + 1; j <= N; j++) {
            a = a + ((i<<3) + (j<<2))*b;
        }
    }
    return a;
}

Once you get to this point you can also observe that, aside from initialising j , i and j are only used in a single mutiple each - i<<3 and j<<2 . 一旦你到达这一点,你还可以观察到,除了初始化jij仅用于单个多个 - i<<3j<<2 We can code this directly in the loop logic, thus: 我们可以直接在循环逻辑中编码,因此:

int foobar(int a, int b, int N) {
    int i, j, iLimit, jLimit;
    a = 0;
    b = 512;
    iLimit = N << 3;
    jLimit = N << 2;
    for (i = 0; i <= iLimit; i+=8) {
        for (j = i >> 1 + 4; j <= jLimit; j+=4) {
            a = a + (i + j)*b;
        }
        b++;
    }
    return a;
}

OK... so here is my solution, along with inline comments to explain what I did and how. 好的......所以这是我的解决方案,以及内联评论来解释我做了什么以及如何做。

int foobar(int N)
{ // We eliminate unused arguments 
    int x = 0, i = 0, i2 = 0, j, k, z;

    // We only iterate up to N on the outer loop, since the
    // last iteration doesn't do anything useful. Also we keep
    // track of '2*i' (which is used throughout the code) by a 
    // second variable 'i2' which we increment by two in every
    // iteration, essentially converting multiplication into addition.
    while(i < N) 
    {           
        // We hoist the calculation '4 * (i+2*k)' out of the loop
        // since k is a literal constant and 'i' is a constant during
        // the inner loop. We could convert the multiplication by 2
        // into a left shift, but hey, let's not go *crazy*! 
        //
        //  (4 * (i+2*k))         <=>
        //  (4 * i) + (4 * 2 * k) <=>
        //  (2 * i2) + (8 * k)    <=>
        //  (2 * i2) + (8 * 512)  <=>
        //  (2 * i2) + 2048

        k = (2 * i2) + 2048;

        // We have now converted the expression:
        //      x = x + 4*(2*i+j)*(i+2*k);
        //
        // into the expression:
        //      x = x + (i2 + j) * k;
        //
        // Counterintuively we now *expand* the formula into:
        //      x = x + (i2 * k) + (j * k);
        //
        // Now observe that (i2 * k) is a constant inside the inner
        // loop which we can calculate only once here. Also observe
        // that is simply added into x a total (N - i) times, so 
        // we take advantange of the abelian nature of addition
        // to hoist it completely out of the loop

        x = x + (i2 * k) * (N - i);

        // Observe that inside this loop we calculate (j * k) repeatedly, 
        // and that j is just an increasing counter. So now instead of
        // doing numerous multiplications, let's break the operation into
        // two parts: a multiplication, which we hoist out of the inner 
        // loop and additions which we continue performing in the inner 
        // loop.

        z = i * k;

        for (j = i + 1; j <= N; j++) 
        {
            z = z + k;          
            x = x + z;      
        }

        i++;
        i2 += 2;
    }   

    return x;
}

The code, without any of the explanations boils down to this: 代码,没有任何解释归结为:

int foobar(int N)
{
    int x = 0, i = 0, i2 = 0, j, k, z;

    while(i < N) 
    {                   
        k = (2 * i2) + 2048;

        x = x + (i2 * k) * (N - i);

        z = i * k;

        for (j = i + 1; j <= N; j++) 
        {
            z = z + k;          
            x = x + z;      
        }

        i++;
        i2 += 2;
    }   

    return x;
}

I hope this helps. 我希望这有帮助。

int foobar(int N) //To avoid unuse passing argument int foobar(int N)//避免不使用传递参数

{ {

int i, j, x=0;   //Remove unuseful variable, operation so save stack and Machine cycle

for (i = N; i--; )               //Don't check unnecessary comparison condition 

   for (j = N+1; --j>i; )

     x += (((i<<1)+j)*(i+512)<<2);  //Save Machine cycle ,Use shift instead of Multiply

return x;

} }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM