简体   繁体   English

C优化问题

[英]C optimization question

I'm wondering what is the fastest way that I can write some code. 我想知道什么是我编写某些代码最快的方法。 I have a loop which executes an add on some ints. 我有一个循环,在一些整数上执行加法。 The loop will be executed many, many times and so I've thought of making comparisons to check if any operands are zero, so they shouldn't be considered to be added, as follows: 该循环将执行很多次,因此我考虑过进行比较以检查是否有任何操作数为零,因此不应将它们视为相加,如下所示:

if (work1 == 0)
{
    if (work2 == 0)
        tempAnswer = toCarry;
    else
        tempAnswer = work2 + toCarry; 
}
else if (work2 == 0)
    tempAnswer = work1 + toCarry;
else
    tempAnswer = work1 + work2 + toCarry;

I believe the nested IF at the top is already an optimisation, in that it is faster than writing a series of comparisons with &&'s, as I would be checking (work1 == 0) more than once. 我相信顶部的嵌套IF已经是一种优化,因为它比与&&进行一系列比较要快,因为我要检查(work1 == 0)不止一次。

Sadly, I wouldn't be able to say just how frequently work1 and work2 would be zero, so assume it's likely to be a balanced distribution of each possible outcome of the IF statement. 令人遗憾的是,我无法说出work1和work2多久为零,因此假设它很可能是IF语句的每个可能结果的平衡分布。

So, in light of that, is the above code faster than just writing tempAnswer = work1 + work2 + toCarry or would all the comparisons possibly cause a lot of drag? 因此,鉴于此,以上代码是否比仅编写tempAnswer = work1 + work2 + toCarry还是所有比较都可能导致大量拖累?

Thanks 谢谢

That is nonsense. 废话

  • Comparing two integers takes just as long as adding two integers. 比较两个整数只需要两个整数相加即可。
  • Doing a branch takes much, much longer than the add (on many, admittedly older (See comments), CPUs) 进行分支的时间比添加的时间长得多(在许多方面,公认的更旧(请参见注释),CPU)
  • On more modern architectures, the bottleneck is accessing values from memory, so this scheme is still not helping where it's needed. 在更现代的体系结构上,瓶颈是从内存访问值,因此该方案仍无法在需要的地方提供帮助。

    Also, think about this logically -- why single out zero as the one value you treat as a special case? 另外,从逻辑上考虑一下-为什么将零视为您认为是特例的一个值? Why not also check for one, and use tempAnswer++ ? 为什么不同时检查一个,并使用tempAnswer++ When you consider all the possibilities, you can see it's a pointless exercise. 当您考虑所有可能性时,您会发现这是没有意义的练习。

The answer, as always, is profile your code . 与往常一样,答案是分析您的代码 Write it both ways, time it, and see which is faster. 用两种方式编写它,计时,然后看哪个更快。

That said, my money would be on the straight addition being faster than a bunch of comparisons. 就是说,我的钱将直接比一堆比较快。 Every comparison implies a potential branch, and branches can wreak havoc on pipelining in your processor. 每个比较都意味着潜在的分支,分支可能会对处理器中的流水线造成严重破坏。

Branching is most likely going to be slower than adding so this is probably counter productive. 分支最有可能比添加慢,因此这可能适得其反。 In any case, it's much harder to read. 无论如何,它都很难阅读。 You really shouldn't try to optimize to this level until you have concrete evidence that you need it. 在没有确凿的证据表明您需要它之前,您真的不应该尝试将其优化到这个水平。 The negative effects on your code are generally not worth it. 对代码的负面影响通常是不值得的。

No, it's not faster. 不,不是更快。 Branch misprediction is a lot more painful than an add. 分支错误预测比添加错误要痛苦得多。

The only situation where conditionally checking before performing an addition will save time is if one can avoid an "expensive" write operation. 在执行加法之前有条件地进行检查可以节省时间的唯一情况是,是否可以避免“昂贵的”写入操作。 For example, something like: 例如,类似:

if (var1 != 0)
    someobject.property1 += var1;

may save time if writing to propert1 would be slow, especially if the property doesn't already optimize out writing the value that's already there. 如果写入propert1会很慢,则可以节省时间,特别是如果属性尚未优化出写入已经存在的值的时间。 On rare occasions one might benefit from: 在极少数情况下,您可能会受益于:

if (var1 != 0)
    volatilevar2 += var1;

if multiple processors are all frequently re-reading volatilevar2, and var1 is usually zero. 如果多个处理器都经常重新读取volatilevar2,而var1通常为零。 It's doubtful a situation where the comparison there was helpful would ever occur "naturally", though one could be contrived. 可以怀疑的是,在那里进行的比较是否会自然而然地发生,这是令人怀疑的。 A slightly-less-contrived version: 稍作伪造的版本:

if (var1 != 0)
    Threading.Interlocked.Add(volatilevar2, var1);

might be beneficial in some naturally-occurring scenarios. 在某些自然发生的情况下可能会有所帮助。

Of course, if the destination of the addition is a local temp variable that won't be shared with other processors, the possibility of a time savings is essentially nil. 当然,如果加法的目的地是本地temp变量,该变量不会与其他处理器共享,那么节省时间的可能性实际上为零。

Aside from the fact that a comparison is typically about as fast as an addition (so you'd have more operations, on the average), and the fact that on many architectures branching is expensive if the CPU can't guess which way it'll go, there's also locality of code. 除了这样的事实:比较通常与加法一样快(因此平均而言,您会有更多的操作),而且在许多体系结构上,如果CPU无法猜测以哪种方式进行分支,则分支成本很高。会,还有代码的局部性。

Modern processors keep as much as possible on cache in the processor, or perhaps on the motherboard. 现代处理器要尽可能多地保留在处理器或主板上的缓存中。 Hitting the main memory is relatively slow, and reading in a memory page is comparatively very slow. 命中主内存相对较慢,而读取内存页面则相对较慢。 There's a hierarchy from fast and small to slow and big. 有一个从快和小到慢和大的层次结构。 One important thing for performance is to try to stay on the "fast and small" side of that hierarchy. 对于性能而言,重要的一件事是尝试停留在该层次结构的“快速和小”方面。

Your code will be in a loop. 您的代码将处于循环中。 If that loop fits in one or two cache lines, you're in great shape, since the CPU can execute the loop with absolutely minimal time to fetch instructions, and without kicking other pieces of memory out of the cache. 如果该循环适合一两个缓存行,那么您的状态就很好,因为CPU可以在绝对少的时间内提取指令来执行该循环,而不会将其他内存从缓存中踢出。

Therefore, when micro-optimizing, you should try to have inner loops contain small code, which typically means simple and short. 因此,在进行微优化时,应尝试使内部循环包含小的代码,这通常意味着简单而简短。 In your case, you've got three comparisons and several adds when you could have no comparisons and two adds. 在您的情况下,当您无法进行比较时,您需要进行三个比较,并进行多个添加,而必须进行两次添加。 This code is much more likely to cause a cache miss than the simpler tempAnswer = work1 + work2 + toCarry; 与更简单的tempAnswer = work1 + work2 + toCarry;相比,此代码更有可能导致高速缓存未命中tempAnswer = work1 + work2 + toCarry; .

Fastest is a relative term. 最快是一个相对术语。 What platform is this for? 这是什么平台? does it have a cache? 有缓存吗? If it has a cache it is likely on a platform that can execute the add in a single clock cycle thus there is no need to optimize out the addition. 如果它具有缓存,则很可能在可以在单个时钟周期内执行添加的平台上,因此无需优化添加。 The next problem is a compare is a subtract subtract and add go through the same alu and take the same time as addition, so for most platforms old and new trading compares (subtraction) for addition wont save you anything, you end up looking at the branch cost, pipeline flush, etc. Even with the ARM platform you still burn a nop or few. 下一个问题是比较,即减法和减法相加,加法时间相同,因此对于大多数平台而言,新旧交易的比较(减法)不会节省任何费用,最终您会发现分支成本,管道刷新等。即使使用ARM平台,您也只能花很少的时间。 The first thing you have to do for optimizations like this is look at the compiler output, what instructions is the compiler choosing? 对于此类优化,您要做的第一件事是查看编译器输出,编译器选择什么指令? (assuming this is the compiler everyone compiling this code is using and same compiler options, etc). (假设这是每个编译此代码的人都在使用的编译器以及相同的编译器选项,等等)。 For example on a chip where add/sub take more than a clock, or a significant number of clocks, xor or and or or operations may take fewer clocks. 例如,在一个芯片中,add / sub占用的时钟多于一个时钟,或者大量时钟,xor或and和or运算可能会占用较少的时钟。 you can do a compare with zero on some processors using a bitwise operation, saving clocks. 您可以在某些处理器上使用按位运算与零进行比较,从而节省了时钟。 Did the compiler figure that out and use that faster operation? 编译器是否发现并使用了更快的操作?

As a general purpose answer to your question, based on the processors out there and the odds of which ones you are or are not using. 作为通用问题的答案,这取决于所处的处理器以及您正在使用或未使用的处理器的几率。 The single line: 单行:

tempAnswer = work1 + work2 + toCarry; tempAnswer = work1 + work2 + toCarry;

is the most optimized code. 是最优化的代码。 The compiler will turn that into two or three instructions for most processors or the processors I am guessing you are likely using. 对于大多数处理器或我猜您可能正在使用的处理器,编译器会将其转换为两个或三个指令。

Your bigger worry is not the add or the comparisons or the branches or branch prediction, your biggest worry is that these variables are kept in registers. 您最大的担心不是加法或比较或分支或分支预测,最大的担心是这些变量保存在寄存器中。 If they all have to to back and forth to the stack/ram, that will slow your loop, even with a cache. 如果它们都必须来回移动到堆栈/内存,即使使用缓存,这也会减慢循环速度。 The other code in the loop will determine this and there are things you can do in your code to minimize register use allowing hopefully for these to be register based. 循环中的其他代码将确定这一点,您可以在代码中做一些事情以最大程度地减少寄存器的使用,从而希望这些代码可以基于寄存器。 Again, disassemble your code to see what the compiler is doing. 同样,反汇编您的代码以查看编译器在做什么。

I agree with the general tenor of the other comments - the 'optimization' is actually a 'pessimization' that makes the code harder to write, read, maintain. 我同意其他评论的总体观点-“优化”实际上是一种“悲观”,使代码难以编写,阅读和维护。

Further, the 'optimized' code is bigger than the simple code. 此外,“优化”代码比简单代码更大。

Example functions 示例函数

$ cat yy.c
int optimizable(int work1, int work2, int toCarry)
{
    int tempAnswer;
    if (work1 == 0)
    {
        if (work2 == 0)
            tempAnswer = toCarry;
        else
            tempAnswer = work2 + toCarry; 
    }
    else if (work2 == 0)
        tempAnswer = work1 + toCarry;
    else
        tempAnswer = work1 + work2 + toCarry;

    return tempAnswer;
}
$ cat xx.c
int optimizable(int work1, int work2, int toCarry)
{
    int tempAnswer;
    tempAnswer = work1 + work2 + toCarry;
    return tempAnswer;
}
$

Compiler 编译器

$ gcc --version
gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-44)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Object file sizes with different levels of optimization 具有不同优化级别的目标文件大小

$ gcc -c yy.c xx.c
$ size xx.o yy.o
   text    data     bss     dec     hex filename
     86       0       0      86      56 xx.o
    134       0       0     134      86 yy.o
$ gcc -O -c yy.c xx.c
$ size xx.o yy.o
   text    data     bss     dec     hex filename
     54       0       0      54      36 xx.o
     71       0       0      71      47 yy.o
$ gcc -O1 -c yy.c xx.c
$ size xx.o yy.o
   text    data     bss     dec     hex filename
     54       0       0      54      36 xx.o
     71       0       0      71      47 yy.o
$ gcc -O2 -c yy.c xx.c
$ size xx.o yy.o
   text    data     bss     dec     hex filename
     54       0       0      54      36 xx.o
     70       0       0      70      46 yy.o
$ gcc -O3 -c yy.c xx.c
$ size xx.o yy.o
   text    data     bss     dec     hex filename
     54       0       0      54      36 xx.o
     70       0       0      70      46 yy.o
$ gcc -O4 -c yy.c xx.c
$ size xx.o yy.o
   text    data     bss     dec     hex filename
     54       0       0      54      36 xx.o
     70       0       0      70      46 yy.o
$

The code is compiled for 64-bit RedHat Linux on AMD x86-64. 该代码针对AMD x86-64上的64位RedHat Linux进行了编译。

The two functions carry the same infrastructure baggage (3 parameters, 1 local, 1 return). 这两个功能携带相同的基础设施包(3个参数,1个本地,1个返回)。 At best, the optimized function is 16 bytes longer than the unoptimized function. 最佳情况下,优化功能最多比未优化功能长16个字节。 Reading the extra code into memory is a performance penalty, and the extra time taken to execute that code is another. 将多余的代码读取到内存中会降低性能,而执行该代码所花费的额外时间则是另一个。

Here comes the classic admonishment: "avoid early optimization". 这是经典的告诫:“避免早期优化”。

Is the function really that critical? 功能真的那么重要吗? Is it called so many times that you have to optimize it? 它被调用了很多次以至于必须对其进行优化吗?

Now, let's look @ Jonathan's answer and think about the "technical debt", ie, maintainability. 现在,让我们看一下乔纳森的回答,并思考“技术债务”,即可维护性。 Think in your particular environment: in one or two years somebody will look at your code and will find it more difficult to understand, or, even worse, he/she will misunderstand it! 在您的特定环境中思考:一两年后,有人会看您的代码,发现它很难理解,或者更糟的是,他/她会误解它!

On top of that, compare xx.c and yy.c: which piece of code has higher chances of having a bug? 最重要的是,比较xx.c和yy.c:哪段代码更有可能出现错误?

Good luck! 祝好运!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM