简体   繁体   中英

C optimization question

I'm wondering what is the fastest way that I can write some code. I have a loop which executes an add on some ints. The loop will be executed many, many times and so I've thought of making comparisons to check if any operands are zero, so they shouldn't be considered to be added, as follows:

if (work1 == 0)
{
    if (work2 == 0)
        tempAnswer = toCarry;
    else
        tempAnswer = work2 + toCarry; 
}
else if (work2 == 0)
    tempAnswer = work1 + toCarry;
else
    tempAnswer = work1 + work2 + toCarry;

I believe the nested IF at the top is already an optimisation, in that it is faster than writing a series of comparisons with &&'s, as I would be checking (work1 == 0) more than once.

Sadly, I wouldn't be able to say just how frequently work1 and work2 would be zero, so assume it's likely to be a balanced distribution of each possible outcome of the IF statement.

So, in light of that, is the above code faster than just writing tempAnswer = work1 + work2 + toCarry or would all the comparisons possibly cause a lot of drag?

Thanks

That is nonsense.

  • Comparing two integers takes just as long as adding two integers.
  • Doing a branch takes much, much longer than the add (on many, admittedly older (See comments), CPUs)
  • On more modern architectures, the bottleneck is accessing values from memory, so this scheme is still not helping where it's needed.

    Also, think about this logically -- why single out zero as the one value you treat as a special case? Why not also check for one, and use tempAnswer++ ? When you consider all the possibilities, you can see it's a pointless exercise.

The answer, as always, is profile your code . Write it both ways, time it, and see which is faster.

That said, my money would be on the straight addition being faster than a bunch of comparisons. Every comparison implies a potential branch, and branches can wreak havoc on pipelining in your processor.

Branching is most likely going to be slower than adding so this is probably counter productive. In any case, it's much harder to read. You really shouldn't try to optimize to this level until you have concrete evidence that you need it. The negative effects on your code are generally not worth it.

No, it's not faster. Branch misprediction is a lot more painful than an add.

The only situation where conditionally checking before performing an addition will save time is if one can avoid an "expensive" write operation. For example, something like:

if (var1 != 0)
    someobject.property1 += var1;

may save time if writing to propert1 would be slow, especially if the property doesn't already optimize out writing the value that's already there. On rare occasions one might benefit from:

if (var1 != 0)
    volatilevar2 += var1;

if multiple processors are all frequently re-reading volatilevar2, and var1 is usually zero. It's doubtful a situation where the comparison there was helpful would ever occur "naturally", though one could be contrived. A slightly-less-contrived version:

if (var1 != 0)
    Threading.Interlocked.Add(volatilevar2, var1);

might be beneficial in some naturally-occurring scenarios.

Of course, if the destination of the addition is a local temp variable that won't be shared with other processors, the possibility of a time savings is essentially nil.

Aside from the fact that a comparison is typically about as fast as an addition (so you'd have more operations, on the average), and the fact that on many architectures branching is expensive if the CPU can't guess which way it'll go, there's also locality of code.

Modern processors keep as much as possible on cache in the processor, or perhaps on the motherboard. Hitting the main memory is relatively slow, and reading in a memory page is comparatively very slow. There's a hierarchy from fast and small to slow and big. One important thing for performance is to try to stay on the "fast and small" side of that hierarchy.

Your code will be in a loop. If that loop fits in one or two cache lines, you're in great shape, since the CPU can execute the loop with absolutely minimal time to fetch instructions, and without kicking other pieces of memory out of the cache.

Therefore, when micro-optimizing, you should try to have inner loops contain small code, which typically means simple and short. In your case, you've got three comparisons and several adds when you could have no comparisons and two adds. This code is much more likely to cause a cache miss than the simpler tempAnswer = work1 + work2 + toCarry; .

Fastest is a relative term. What platform is this for? does it have a cache? If it has a cache it is likely on a platform that can execute the add in a single clock cycle thus there is no need to optimize out the addition. The next problem is a compare is a subtract subtract and add go through the same alu and take the same time as addition, so for most platforms old and new trading compares (subtraction) for addition wont save you anything, you end up looking at the branch cost, pipeline flush, etc. Even with the ARM platform you still burn a nop or few. The first thing you have to do for optimizations like this is look at the compiler output, what instructions is the compiler choosing? (assuming this is the compiler everyone compiling this code is using and same compiler options, etc). For example on a chip where add/sub take more than a clock, or a significant number of clocks, xor or and or or operations may take fewer clocks. you can do a compare with zero on some processors using a bitwise operation, saving clocks. Did the compiler figure that out and use that faster operation?

As a general purpose answer to your question, based on the processors out there and the odds of which ones you are or are not using. The single line:

tempAnswer = work1 + work2 + toCarry;

is the most optimized code. The compiler will turn that into two or three instructions for most processors or the processors I am guessing you are likely using.

Your bigger worry is not the add or the comparisons or the branches or branch prediction, your biggest worry is that these variables are kept in registers. If they all have to to back and forth to the stack/ram, that will slow your loop, even with a cache. The other code in the loop will determine this and there are things you can do in your code to minimize register use allowing hopefully for these to be register based. Again, disassemble your code to see what the compiler is doing.

I agree with the general tenor of the other comments - the 'optimization' is actually a 'pessimization' that makes the code harder to write, read, maintain.

Further, the 'optimized' code is bigger than the simple code.

Example functions

$ cat yy.c
int optimizable(int work1, int work2, int toCarry)
{
    int tempAnswer;
    if (work1 == 0)
    {
        if (work2 == 0)
            tempAnswer = toCarry;
        else
            tempAnswer = work2 + toCarry; 
    }
    else if (work2 == 0)
        tempAnswer = work1 + toCarry;
    else
        tempAnswer = work1 + work2 + toCarry;

    return tempAnswer;
}
$ cat xx.c
int optimizable(int work1, int work2, int toCarry)
{
    int tempAnswer;
    tempAnswer = work1 + work2 + toCarry;
    return tempAnswer;
}
$

Compiler

$ gcc --version
gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Object file sizes with different levels of optimization

$ gcc -c yy.c xx.c
$ size xx.o yy.o
   text    data     bss     dec     hex filename
     86       0       0      86      56 xx.o
    134       0       0     134      86 yy.o
$ gcc -O -c yy.c xx.c
$ size xx.o yy.o
   text    data     bss     dec     hex filename
     54       0       0      54      36 xx.o
     71       0       0      71      47 yy.o
$ gcc -O1 -c yy.c xx.c
$ size xx.o yy.o
   text    data     bss     dec     hex filename
     54       0       0      54      36 xx.o
     71       0       0      71      47 yy.o
$ gcc -O2 -c yy.c xx.c
$ size xx.o yy.o
   text    data     bss     dec     hex filename
     54       0       0      54      36 xx.o
     70       0       0      70      46 yy.o
$ gcc -O3 -c yy.c xx.c
$ size xx.o yy.o
   text    data     bss     dec     hex filename
     54       0       0      54      36 xx.o
     70       0       0      70      46 yy.o
$ gcc -O4 -c yy.c xx.c
$ size xx.o yy.o
   text    data     bss     dec     hex filename
     54       0       0      54      36 xx.o
     70       0       0      70      46 yy.o
$

The code is compiled for 64-bit RedHat Linux on AMD x86-64.

The two functions carry the same infrastructure baggage (3 parameters, 1 local, 1 return). At best, the optimized function is 16 bytes longer than the unoptimized function. Reading the extra code into memory is a performance penalty, and the extra time taken to execute that code is another.

Here comes the classic admonishment: "avoid early optimization".

Is the function really that critical? Is it called so many times that you have to optimize it?

Now, let's look @ Jonathan's answer and think about the "technical debt", ie, maintainability. Think in your particular environment: in one or two years somebody will look at your code and will find it more difficult to understand, or, even worse, he/she will misunderstand it!

On top of that, compare xx.c and yy.c: which piece of code has higher chances of having a bug?

Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM