[英]Optimization of C code
For an assignment of a course called High Performance Computing, I required to optimize the following code fragment: 对于高性能计算课程的分配,我需要优化以下代码片段:
int foobar(int a, int b, int N)
{
int i, j, k, x, y;
x = 0;
y = 0;
k = 256;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
if (i > j){
y = y + 8*(i-j);
}else{
y = y + 8*(j-i);
}
}
}
return x;
}
Using some recommendations, I managed to optimize the code (or at least I think so), such as: 使用一些建议,我设法优化代码(或至少我认为如此),例如:
Here's my code: 这是我的代码:
int foobar(int a, int b, int N) {
int i, j, x, y, t;
x = 0;
y = 0;
for (i = 0; i <= N; i++) {
t = i + 512;
for (j = i + 1; j <= N; j++) {
x = x + ((i<<3) + (j<<2))*t;
}
}
return x;
}
According to my instructor, a well optimized code instructions should have fewer or less costly instructions in assembly language level.And therefore must be run, the instructions in less time than the original code, ie calculations are made with:: 根据我的导师的说法,优化良好的代码指令应该在汇编语言级别中具有更少或更少成本的指令。因此必须运行,指令在比原始代码更短的时间内,即使用::
execution time = instruction count * cycles per instruction 执行时间=指令计数*每条指令的周期
When I generate assembly code using the command: gcc -o code_opt.s -S foobar.c
, 当我使用以下命令生成汇编代码时: gcc -o code_opt.s -S foobar.c
,
the generated code has many more lines than the original despite having made some optimizations, and run-time is lower, but not as much as in the original code. 尽管已经进行了一些优化,但生成的代码拥有比原始代码多得多的行,并且运行时间较低,但没有原始代码那么多。 What am I doing wrong? 我究竟做错了什么?
Do not paste the assembly code as both are very extensive. 不要粘贴汇编代码,因为两者都非常广泛。 So I'm calling the function "foobar" in the main and I am measuring the execution time using the time command in linux 所以我在main中调用函数“foobar”,我在linux中使用time命令测量执行时间
int main () {
int a,b,N;
scanf ("%d %d %d",&a,&b,&N);
printf ("%d\n",foobar (a,b,N));
return 0;
}
y
does not affect the final result of the code - removed: y
不影响代码的最终结果 - 删除:
int foobar(int a, int b, int N)
{
int i, j, k, x, y;
x = 0;
//y = 0;
k = 256;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
//if (i > j){
// y = y + 8*(i-j);
//}else{
// y = y + 8*(j-i);
//}
}
}
return x;
}
k
is simply a constant: k
只是一个常数:
int foobar(int a, int b, int N)
{
int i, j, x;
x = 0;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*256);
}
}
return x;
}
The inner expression can be transformed to: x += 8*i*i + 4096*i + 4*i*j + 2048*j
. 内部表达式可以转换为: x += 8*i*i + 4096*i + 4*i*j + 2048*j
。 Use math to push all of them to the outer loop: x += 8*i*i*(Ni) + 4096*i*(Ni) + 2*i*(Ni)*(N+i+1) + 1024*(Ni)*(N+i+1)
. 使用math将它们全部推到外循环: x += 8*i*i*(Ni) + 4096*i*(Ni) + 2*i*(Ni)*(N+i+1) + 1024*(Ni)*(N+i+1)
。
You can expand the above expression, and apply sum of squares and sum of cubes formula to obtain a close form expression, which should run faster than the doubly nested loop. 您可以展开上面的表达式,并应用平方和和多维数据集公式的总和来获得一个紧密的表单表达式,它应该比双嵌套循环运行得更快。 I leave it as an exercise to you. 我把它作为锻炼留给你。 As a result, i
and j
will also be removed. 结果, i
和j
也将被删除。
a
and b
should also be removed if possible - since a
and b
are supplied as argument but never used in your code. a
和b
也应去除如果可能的话-因为a
和b
作为参数提供,但在你的代码从来没有使用过。
Sum of squares and sum of cubes formula: 平方和和方块总和公式:
Initially: 原来:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
if (i > j){
y = y + 8*(i-j);
}else{
y = y + 8*(j-i);
}
}
}
Removing y
calculations: 删除y
计算:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
}
}
Splitting i
, j
, k
: 分裂i
, j
, k
:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 8*i*i + 16*i*k ; // multiple of 1 (no j)
x = x + (4*i + 8*k)*j ; // multiple of j
}
}
Moving them externally (and removing the loop that runs Ni
times): 从外部移动它们(并移除运行Ni
次数的循环):
for (i = 0; i <= N; i++) {
x = x + (8*i*i + 16*i*k) * (N-i) ;
x = x + (4*i + 8*k) * ((N*N+N)/2 - (i*i+i)/2) ;
}
Rewritting: Rewritting:
for (i = 0; i <= N; i++) {
x = x + ( 8*k*(N*N+N)/2 ) ;
x = x + i * ( 16*k*N + 4*(N*N+N)/2 + 8*k*(-1/2) ) ;
x = x + i*i * ( 8*N + 16*k*(-1) + 4*(-1/2) + 8*k*(-1/2) );
x = x + i*i*i * ( 8*(-1) + 4*(-1/2) ) ;
}
Rewritting - recalculating: 重写 - 重新计算:
for (i = 0; i <= N; i++) {
x = x + 4*k*(N*N+N) ; // multiple of 1
x = x + i * ( 16*k*N + 2*(N*N+N) - 4*k ) ; // multiple of i
x = x + i*i * ( 8*N - 20*k - 2 ) ; // multiple of i^2
x = x + i*i*i * ( -10 ) ; // multiple of i^3
}
Another move to external (and removal of the i loop): 外部的另一个移动(并删除i循环):
x = x + ( 4*k*(N*N+N) ) * (N+1) ;
x = x + ( 16*k*N + 2*(N*N+N) - 4*k ) * ((N*(N+1))/2) ;
x = x + ( 8*N - 20*k - 2 ) * ((N*(N+1)*(2*N+1))/6);
x = x + (-10) * ((N*N*(N+1)*(N+1))/4) ;
Both the above loop removals use the summation formulas: 上述循环删除都使用求和公式:
Sum(1, i = 0..n) = n+1 Sum(1,i = 0..n)= n + 1
Sum(i 1 , i = 0..n) = n(n + 1)/2 Sum(i 1 ,i = 0..n)= n(n + 1)/ 2
Sum(i 2 , i = 0..n) = n(n + 1)(2n + 1)/6 Sum(i 2 ,i = 0..n)= n(n + 1)(2n + 1)/ 6
Sum(i 3 , i = 0..n) = n 2 (n + 1) 2 /4 总和(I 3,I = 0..N)= N 2(N + 1)2/4
This function is equivalent with the following formula, which contains only 4 integer multiplications , and 1 integer division : 此函数与以下公式等效,该公式仅包含4个整数乘法和1个整数除法 :
x = N * (N + 1) * (N * (7 * N + 8187) - 2050) / 6;
To get this, I simply typed the sum calculated by your nested loops into Wolfram Alpha : 为此,我只需将嵌套循环计算的总和输入Wolfram Alpha :
sum (sum (8*i*i+4096*i+4*i*j+2048*j), j=i+1..N), i=0..N
Here is the direct link to the solution. 这是解决方案的直接链接。 Think before coding. 在编码前思考。 Sometimes your brain can optimize code better than any compiler. 有时你的大脑可以比任何编译器更好地优化代码。
Briefly scanning the first routine, the first thing you notice is that expressions involving "y" are completely unused and can be eliminated (as you did). 简单地扫描第一个例程,你注意到的第一件事是涉及“y”的表达式是完全未使用的并且可以被删除(正如你所做的那样)。 This further permits eliminating the if/else (as you did). 这进一步允许消除if / else(就像你一样)。
What remains is the two for
loops and the messy expression. 剩下的是两个for
循环和凌乱的表达。 Factoring out the pieces of that expression that do not depend on j
is the next step. 下一步就是将那些不依赖于j
表达式分解出来。 You removed one such expression, but (i<<3)
(ie, i * 8) remains in the inner loop, and can be removed. 你删除了一个这样的表达式,但是(i<<3)
(即i * 8)保留在内循环中,可以删除。
Pascal's answer reminded me that you can use a loop stride optimization. Pascal的回答提醒我,你可以使用循环步幅优化。 First move (i<<3) * t
out of the inner loop (call it i1
), then calculate, when initializing the loop, a value j1
that equals (i<<2) * t
. 首先移动(i<<3) * t
离开内循环(称之为i1
),然后在初始化循环时计算等于(i<<2) * t
的值j1
。 On each iteration increment j1
by 4 * t
(which is a pre-calculated constant). 在每次迭代时,将j1
增加4 * t
(这是预先计算的常数)。 Replace your inner expression with x = x + i1 + j1;
用x = x + i1 + j1;
替换内部表达式x = x + i1 + j1;
. 。
One suspects that there may be some way to combine the two loops into one, with a stride, but I'm not seeing it offhand. 有人怀疑可能有某种方法可以将两个循环合二为一,但是我没有看到它。
A few other things I can see. 我能看到的其他一些事情。 You don't need y
, so you can remove its declaration and initialisation. 您不需要y
,因此您可以删除其声明和初始化。
Also, the values passed in for a
and b
aren't actually used, so you could use these as local variables instead of x
and t
. 此外,实际上并未使用传入a
和b
的值,因此您可以将它们用作局部变量而不是x
和t
。
Also, rather than adding i
to 512 each time through you can note that t
starts at 512 and increments by 1 each iteration. 此外,不是每次通过你添加i
到512,你可以注意到t
从512开始并且每次迭代增加1。
int foobar(int a, int b, int N) {
int i, j;
a = 0;
b = 512;
for (i = 0; i <= N; i++, b++) {
for (j = i + 1; j <= N; j++) {
a = a + ((i<<3) + (j<<2))*b;
}
}
return a;
}
Once you get to this point you can also observe that, aside from initialising j
, i
and j
are only used in a single mutiple each - i<<3
and j<<2
. 一旦你到达这一点,你还可以观察到,除了初始化j
, i
和j
仅用于单个多个 - i<<3
和j<<2
。 We can code this directly in the loop logic, thus: 我们可以直接在循环逻辑中编码,因此:
int foobar(int a, int b, int N) {
int i, j, iLimit, jLimit;
a = 0;
b = 512;
iLimit = N << 3;
jLimit = N << 2;
for (i = 0; i <= iLimit; i+=8) {
for (j = i >> 1 + 4; j <= jLimit; j+=4) {
a = a + (i + j)*b;
}
b++;
}
return a;
}
OK... so here is my solution, along with inline comments to explain what I did and how. 好的......所以这是我的解决方案,以及内联评论来解释我做了什么以及如何做。
int foobar(int N)
{ // We eliminate unused arguments
int x = 0, i = 0, i2 = 0, j, k, z;
// We only iterate up to N on the outer loop, since the
// last iteration doesn't do anything useful. Also we keep
// track of '2*i' (which is used throughout the code) by a
// second variable 'i2' which we increment by two in every
// iteration, essentially converting multiplication into addition.
while(i < N)
{
// We hoist the calculation '4 * (i+2*k)' out of the loop
// since k is a literal constant and 'i' is a constant during
// the inner loop. We could convert the multiplication by 2
// into a left shift, but hey, let's not go *crazy*!
//
// (4 * (i+2*k)) <=>
// (4 * i) + (4 * 2 * k) <=>
// (2 * i2) + (8 * k) <=>
// (2 * i2) + (8 * 512) <=>
// (2 * i2) + 2048
k = (2 * i2) + 2048;
// We have now converted the expression:
// x = x + 4*(2*i+j)*(i+2*k);
//
// into the expression:
// x = x + (i2 + j) * k;
//
// Counterintuively we now *expand* the formula into:
// x = x + (i2 * k) + (j * k);
//
// Now observe that (i2 * k) is a constant inside the inner
// loop which we can calculate only once here. Also observe
// that is simply added into x a total (N - i) times, so
// we take advantange of the abelian nature of addition
// to hoist it completely out of the loop
x = x + (i2 * k) * (N - i);
// Observe that inside this loop we calculate (j * k) repeatedly,
// and that j is just an increasing counter. So now instead of
// doing numerous multiplications, let's break the operation into
// two parts: a multiplication, which we hoist out of the inner
// loop and additions which we continue performing in the inner
// loop.
z = i * k;
for (j = i + 1; j <= N; j++)
{
z = z + k;
x = x + z;
}
i++;
i2 += 2;
}
return x;
}
The code, without any of the explanations boils down to this: 代码,没有任何解释归结为:
int foobar(int N)
{
int x = 0, i = 0, i2 = 0, j, k, z;
while(i < N)
{
k = (2 * i2) + 2048;
x = x + (i2 * k) * (N - i);
z = i * k;
for (j = i + 1; j <= N; j++)
{
z = z + k;
x = x + z;
}
i++;
i2 += 2;
}
return x;
}
I hope this helps. 我希望这有帮助。
int foobar(int N) //To avoid unuse passing argument int foobar(int N)//避免不使用传递参数
{ {
int i, j, x=0; //Remove unuseful variable, operation so save stack and Machine cycle
for (i = N; i--; ) //Don't check unnecessary comparison condition
for (j = N+1; --j>i; )
x += (((i<<1)+j)*(i+512)<<2); //Save Machine cycle ,Use shift instead of Multiply
return x;
} }
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.