简体   繁体   English

< 比 <= 快吗?

[英]Is < faster than <=?

Is if (a < 901) faster than if (a <= 900) ? if (a < 901)是否比if (a <= 900)快?

Not exactly as in this simple example, but there are slight performance changes on loop complex code.与这个简单的示例不完全一样,但循环复杂代码的性能略有变化。 I suppose this has to do something with generated machine code in case it's even true.我想这必须与生成的机器代码有关,以防万一。

No, it will not be faster on most architectures.不,在大多数架构上它不会更快。 You didn't specify, but on x86, all of the integral comparisons will be typically implemented in two machine instructions:您没有指定,但在 x86 上,所有积分比较通常都将在两条机器指令中实现:

  • A test or cmp instruction, which sets EFLAGS设置EFLAGStestcmp指令
  • And a Jcc (jump) instruction , depending on the comparison type (and code layout):还有一个Jcc (jump) 指令,具体取决于比较类型(和代码布局):
  • jne - Jump if not equal --> ZF = 0 jne - 如果不相等则跳转 --> ZF = 0
  • jz - Jump if zero (equal) --> ZF = 1 jz - 如果为零(等于)则跳转 --> ZF = 1
  • jg - Jump if greater --> ZF = 0 and SF = OF jg - 如果大于则跳转 --> ZF = 0 and SF = OF
  • (etc...) (ETC...)

Example (Edited for brevity) Compiled with $ gcc -m32 -S -masm=intel test.c示例(为简洁而编辑)使用$ gcc -m32 -S -masm=intel test.c编译

    if (a < b) {
        // Do something 1
    }

Compiles to:编译为:

    mov     eax, DWORD PTR [esp+24]      ; a
    cmp     eax, DWORD PTR [esp+28]      ; b
    jge     .L2                          ; jump if a is >= b
    ; Do something 1
.L2:

And

    if (a <= b) {
        // Do something 2
    }

Compiles to:编译为:

    mov     eax, DWORD PTR [esp+24]      ; a
    cmp     eax, DWORD PTR [esp+28]      ; b
    jg      .L5                          ; jump if a is > b
    ; Do something 2
.L5:

So the only difference between the two is a jg versus a jge instruction.因此,两者之间的唯一区别是jgjge指令。 The two will take the same amount of time.两者将花费相同的时间。


I'd like to address the comment that nothing indicates that the different jump instructions take the same amount of time.我想解决没有任何迹象表明不同的跳转指令需要相同的时间的评论。 This one is a little tricky to answer, but here's what I can give: In the Intel Instruction Set Reference , they are all grouped together under one common instruction, Jcc (Jump if condition is met).这个回答有点棘手,但我可以给出以下内容:在Intel Instruction Set Reference中,它们都被组合在一个通用指令Jcc下(如果满足条件则跳转)。 The same grouping is made together under the Optimization Reference Manual , in Appendix C. Latency and Throughput.优化参考手册的附录 C. 延迟和吞吐量中进行了相同的分组。

Latency — The number of clock cycles that are required for the execution core to complete the execution of all of the μops that form an instruction.延迟— 执行内核完成构成指令的所有 μop 的执行所需的时钟周期数。

Throughput — The number of clock cycles required to wait before the issue ports are free to accept the same instruction again.吞吐量——在发布端口可以再次自由地接受相同指令之前需要等待的时钟周期数。 For many instructions, the throughput of an instruction can be significantly less than its latency对于许多指令,一条指令的吞吐量可能远低于其延迟

The values for Jcc are: Jcc的值为:

      Latency   Throughput
Jcc     N/A        0.5

with the following footnote on Jcc :Jcc上有以下脚注:

  1. Selection of conditional jump instructions should be based on the recommendation of section Section 3.4.1, “Branch Prediction Optimization,” to improve the predictability of branches.条件跳转指令的选择应基于第 3.4.1 节“分支预测优化”的建议,以提高分支的可预测性。 When branches are predicted successfully, the latency of jcc is effectively zero.当分支预测成功时, jcc的延迟实际上为零。

So, nothing in the Intel docs ever treats one Jcc instruction any differently from the others.因此,英特尔文档中的任何内容都不会将一条Jcc指令与其他指令区别对待。

If one thinks about the actual circuitry used to implement the instructions, one can assume that there would be simple AND/OR gates on the different bits in EFLAGS , to determine whether the conditions are met.如果考虑用于实现指令的实际电路,可以假设在EFLAGS的不同位上有简单的 AND/OR 门,以确定是否满足条件。 There is then, no reason that an instruction testing two bits should take any more or less time than one testing only one (Ignoring gate propagation delay, which is much less than the clock period.)那么,一条指令测试两位的时间没有理由比一条只测试一位的指令花费更多或更少的时间(忽略门传播延迟,它远小于时钟周期。)


Edit: Floating Point编辑:浮点

This holds true for x87 floating point as well: (Pretty much same code as above, but with double instead of int .)这也适用于 x87 浮点:(与上面的代码几乎相同,但使用double而不是int 。)

        fld     QWORD PTR [esp+32]
        fld     QWORD PTR [esp+40]
        fucomip st, st(1)              ; Compare ST(0) and ST(1), and set CF, PF, ZF in EFLAGS
        fstp    st(0)
        seta    al                     ; Set al if above (CF=0 and ZF=0).
        test    al, al
        je      .L2
        ; Do something 1
.L2:

        fld     QWORD PTR [esp+32]
        fld     QWORD PTR [esp+40]
        fucomip st, st(1)              ; (same thing as above)
        fstp    st(0)
        setae   al                     ; Set al if above or equal (CF=0).
        test    al, al
        je      .L5
        ; Do something 2
.L5:
        leave
        ret

Historically (we're talking the 1980s and early 1990s), there were some architectures in which this was true.从历史上看(我们谈论的是 1980 年代和 1990 年代初期),有些架构确实如此。 The root issue is that integer comparison is inherently implemented via integer subtractions.根本问题是整数比较本质上是通过整数减法实现的。 This gives rise to the following cases.这导致了以下情况。

Comparison     Subtraction
----------     -----------
A < B      --> A - B < 0
A = B      --> A - B = 0
A > B      --> A - B > 0

Now, when A < B the subtraction has to borrow a high-bit for the subtraction to be correct, just like you carry and borrow when adding and subtracting by hand.现在,当A < B时,减法必须借一个高位才能使减法正确,就像手动加减时进位和借位一样。 This "borrowed" bit was usually referred to as the carry bit and would be testable by a branch instruction.这个“借来的”位通常被称为进位位,可以通过分支指令进行测试。 A second bit called the zero bit would be set if the subtraction were identically zero which implied equality.如果减法相同为零,则将设置称为零位的第二位,这意味着相等。

There were usually at least two conditional branch instructions, one to branch on the carry bit and one on the zero bit.通常至少有两个条件分支指令,一个在进位位上分支,一个在零位上。

Now, to get at the heart of the matter, let's expand the previous table to include the carry and zero bit results.现在,为了了解问题的核心,让我们扩展上表以包括进位和零位结果。

Comparison     Subtraction  Carry Bit  Zero Bit
----------     -----------  ---------  --------
A < B      --> A - B < 0    0          0
A = B      --> A - B = 0    1          1
A > B      --> A - B > 0    1          0

So, implementing a branch for A < B can be done in one instruction, because the carry bit is clear only in this case, , that is,因此,可以在一条指令中实现A < B的分支,因为只有在这种情况下进位位才清零,即

;; Implementation of "if (A < B) goto address;"
cmp  A, B          ;; compare A to B
bcz  address       ;; Branch if Carry is Zero to the new address

But, if we want to do a less-than-or-equal comparison, we need to do an additional check of the zero flag to catch the case of equality.但是,如果我们想要进行小于或等于比较,我们需要对零标志进行额外的检查以捕捉相等的情况。

;; Implementation of "if (A <= B) goto address;"
cmp A, B           ;; compare A to B
bcz address        ;; branch if A < B
bzs address        ;; also, Branch if the Zero bit is Set

So, on some machines, using a "less than" comparison might save one machine instruction .因此,在某些机器上,使用“小于”比较可能会节省一条机器指令 This was relevant in the era of sub-megahertz processor speed and 1:1 CPU-to-memory speed ratios, but it is almost totally irrelevant today.这在亚兆赫处理器速度和 1:1 CPU 与内存速度比的时代是相关的,但在今天几乎完全无关紧要。

Assuming we're talking about internal integer types, there's no possible way one could be faster than the other.假设我们谈论的是内部整数类型,那么一种方法不可能比另一种更快。 They're obviously semantically identical.它们显然在语义上是相同的。 They both ask the compiler to do precisely the same thing.他们都要求编译器做同样的事情。 Only a horribly broken compiler would generate inferior code for one of these.只有一个严重损坏的编译器会为其中一个生成劣质代码。

If there was some platform where < was faster than <= for simple integer types, the compiler should always convert <= to < for constants.如果在某些平台上,对于简单的整数类型, <<=更快,编译器应该总是<=转换为<对于常量。 Any compiler that didn't would just be a bad compiler (for that platform).任何没有的编译器都只是一个糟糕的编译器(对于那个平台)。

I see that neither is faster.我看到两者都不是更快。 The compiler generates the same machine code in each condition with a different value.编译器在每个条件下生成具有不同值的相同机器代码。

if(a < 901)
cmpl  $900, -4(%rbp)
jg .L2

if(a <=901)
cmpl  $901, -4(%rbp)
jg .L3

My example if is from GCC on x86_64 platform on Linux.我的例子if来自 Linux 上 x86_64 平台上的 GCC。

Compiler writers are pretty smart people, and they think of these things and many others most of us take for granted.编译器编写者是非常聪明的人,他们会想到这些事情以及我们大多数人认为理所当然的许多其他事情。

I noticed that if it is not a constant, then the same machine code is generated in either case.我注意到如果它不是一个常数,那么在任何一种情况下都会生成相同的机器代码。

int b;
if(a < b)
cmpl  -4(%rbp), %eax
jge   .L2

if(a <=b)
cmpl  -4(%rbp), %eax
jg .L3

For floating point code, the <= comparison may indeed be slower (by one instruction) even on modern architectures.对于浮点代码,即使在现代架构上, <= 比较也可能确实更慢(通过一条指令)。 Here's the first function:这是第一个功能:

int compare_strict(double a, double b) { return a < b; }

On PowerPC, first this performs a floating point comparison (which updates cr , the condition register), then moves the condition register to a GPR, shifts the "compared less than" bit into place, and then returns.在 PowerPC 上,首先执行浮点比较(更新cr ,条件寄存器),然后将条件寄存器移动到 GPR,将“比较小于”位移动到位,然后返回。 It takes four instructions.它需要四个指令。

Now consider this function instead:现在考虑这个函数:

int compare_loose(double a, double b) { return a <= b; }

This requires the same work as compare_strict above, but now there's two bits of interest: "was less than" and "was equal to."这需要与上面的compare_strict相同的工作,但现在有两个有趣的地方:“小于”和“等于”。 This requires an extra instruction ( cror - condition register bitwise OR) to combine these two bits into one.这需要一条额外的指令( cror - 条件寄存器按位或)将这两位合并为一个。 So compare_loose requires five instructions, while compare_strict requires four.所以compare_loose需要 5 条指令,而compare_strict需要 4 条。

You might think that the compiler could optimize the second function like so:你可能认为编译器可以像这样优化第二个函数:

int compare_loose(double a, double b) { return ! (a > b); }

However this will incorrectly handle NaNs.但是,这将错误地处理 NaN。 NaN1 <= NaN2 and NaN1 > NaN2 need to both evaluate to false. NaN1 <= NaN2NaN1 > NaN2都需要评估为假。

Maybe the author of that unnamed book has read that a > 0 runs faster than a >= 1 and thinks that is true universally.也许那本无名书的作者已经读过a > 0a >= 1运行得更快,并且认为这是普遍适用的。

But it is because a 0 is involved (because CMP can, depending on the architecture, replaced eg with OR ) and not because of the < .但这是因为涉及0 (因为CMP可以根据架构,例如替换为OR )而不是因为<

至少,如果这是真的,编译器可以简单地将 a <= b 优化为 !(a > b),因此即使比较本身实际上更慢,除了最天真的编译器之外,您不会注意到任何差异.

TL;DR answer TL;DR答案

For most combinations of architecture, compiler and language, < will not be faster than <= .对于架构、编译器和语言的大多数组合, <不会比<=快。

Full answer完整答案

Other answers have concentrated on x86 architecture, and I don't know the ARM architecture (which your example assembler seems to be) well enough to comment specifically on the code generated, but this is an example of a micro-optimisation which is very architecture specific, and is as likely to be an anti-optimisation as it is to be an optimisation .其他答案集中在x86架构上,我不知道ARM架构(您的示例汇编器似乎是)足以专门评论生成的代码,但这一个非常架构的微优化示例具体的,并且很可能是一种反优化,因为它是一种优化

As such, I would suggest that this sort of micro-optimisation is an example of cargo cult programming rather than best software engineering practice.因此,我建议这种微优化货物崇拜编程的一个例子,而不是最佳软件工程实践。

Counterexample反例

There are probably some architectures where this is an optimisation, but I know of at least one architecture where the opposite may be true.可能有一些架构是一种优化,但我知道至少有一种架构可能相反。 The venerable Transputer architecture only had machine code instructions for equal to and greater than or equal to , so all comparisons had to be built from these primitives.古老的Transputer架构只有等于大于或等于 的机器代码指令,因此所有比较都必须从这些原语构建。

Even then, in almost all cases, the compiler could order the evaluation instructions in such a way that in practice, no comparison had any advantage over any other.即便如此,在几乎所有情况下,编译器都可以以这样一种方式对评估指令进行排序,即在实践中,没有任何比较比其他任何比较有任何优势。 Worst case though, it might need to add a reverse instruction (REV) to swap the top two items on the operand stack .但最坏的情况是,它可能需要添加反向指令 (REV) 来交换操作数堆栈上的顶部两项。 This was a single byte instruction which took a single cycle to run, so had the smallest overhead possible.这是一个单字节指令,需要一个周期才能运行,因此开销可能最小。

Summary概括

Whether or not a micro-optimisation like this is an optimisation or an anti-optimisation depends on the specific architecture you are using, so it is usually a bad idea to get into the habit of using architecture specific micro-optimisations, otherwise you might instinctively use one when it is inappropriate to do so, and it looks like this is exactly what the book you are reading is advocating.像这样的微优化是优化还是反优化取决于您使用的特定架构,因此养成使用特定架构微优化的习惯通常是个坏主意,否则您可能会本能地在不合适的情况下使用它,看起来这正是您正在阅读的书所提倡的。

They have the same speed.它们具有相同的速度。 Maybe in some special architecture what he/she said is right, but in the x86 family at least I know they are the same.也许在某些特殊的架构中,他/她说的是对的,但在 x86 家族中,至少我知道它们是相同的。 Because for doing this the CPU will do a substraction (a - b) and then check the flags of the flag register.因为为此,CPU 将执行减法 (a - b),然后检查标志寄存器的标志。 Two bits of that register are called ZF (zero Flag) and SF (sign flag), and it is done in one cycle, because it will do it with one mask operation.该寄存器的两位称为 ZF(零标志)和 SF(符号标志),它在一个周期内完成,因为它将通过一次掩码操作完成。

This would be highly dependent on the underlying architecture that the C is compiled to.这将高度依赖于 C 编译到的底层架构。 Some processors and architectures might have explicit instructions for equal to, or less than and equal to, which execute in different numbers of cycles.一些处理器和体系结构可能具有明确的等于、小于和等于指令,它们以不同的周期数执行。

That would be pretty unusual though, as the compiler could work around it, making it irrelevant.不过,这将是非常不寻常的,因为编译器可以解决它,使其无关紧要。

You should not be able to notice the difference even if there is any.即使有任何差异,您也不应该注意到差异。 Besides, in practice, you'll have to do an additional a + 1 or a - 1 to make the condition stand unless you're going to use some magic constants, which is a very bad practice by all means.此外,在实践中,除非您打算使用一些魔术常数,否则您必须额外执行a + 1a - 1才能使条件成立,这无论如何都是非常糟糕的做法。

When I wrote the first version of this answer, I was only looking at the title question about < vs. <= in general, not the specific example of a constant a < 901 vs. a <= 900 .当我写这个答案的第一个版本时,我只是在看关于 < 与 <= 的标题问题,而不是常数a < 901a <= 900的具体示例。 Many compilers always shrink the magnitude of constants by converting between < and <= , eg because x86 immediate operand have a shorter 1-byte encoding for -128..127.许多编译器总是通过在<<=之间进行转换来缩小常量的大小,例如,因为 x86 立即操作数对 -128..127 具有更短的 1 字节编码。

For ARM, being able to encode as an immediate depends on being able to rotate a narrow field into any position in a word.对于 ARM,能够编码为立即数取决于能够将窄域旋转到单词中的任何位置。 So cmp r0, #0x00f000 would be encodeable, while cmp r0, #0x00efff would not be.因此cmp r0, #0x00f000将是可编码的,而cmp r0, #0x00efff不会。 So the make-it-smaller rule for comparison vs. a compile-time constant doesn't always apply for ARM.因此,用于比较与编译时常量的缩小规则并不总是适用于 ARM。 AArch64 is either shift-by-12 or not, instead of an arbitrary rotation, for instructions like cmp and cmn , unlike 32-bit ARM and Thumb modes.与 32 位 ARM 和 Thumb 模式不同,对于cmpcmn等指令,AArch64 要么移位 12 位,要么不移位,而不是任意旋转。


< vs. <= in general, including for runtime-variable conditions < 与 <= 通常,包括运行时变量条件

In assembly language on most machines, a comparison for <= has the same cost as a comparison for < .在大多数机器上的汇编语言中,比较<=的成本与比较<的成本相同。 This applies whether you're branching on it, booleanizing it to create a 0/1 integer, or using it as a predicate for a branchless select operation (like x86 CMOV).无论您是在其上进行分支、对其进行布尔化以创建 0/1 整数,还是将其用作无分支选择操作的谓词(如 x86 CMOV),这都适用。 The other answers have only addressed this part of the question.其他答案只解决了这部分问题。

But this question is about the C++ operators, the input to the optimizer.但是这个问题是关于 C++ 运算符,优化器的输入 Normally they're both equally efficient;通常它们都同样有效。 the advice from the book sounds totally bogus because compilers can always transform the comparison that they implement in asm.书中的建议听起来完全是假的,因为编译器总是可以转换他们在 asm 中实现的比较。 But there is at least one exception where using <= can accidentally create something the compiler can't optimize.但是至少有一个例外,使用<=可能会意外地创建编译器无法优化的东西。

As a loop condition, there are cases where <= is qualitatively different from < , when it stops the compiler from proving that a loop is not infinite.作为循环条件,在某些情况下<=<的不同,当它阻止编译器证明循环不是无限的时。 This can make a big difference, disabling auto-vectorization.这可以产生很大的不同,禁用自动矢量化。

Unsigned overflow is well-defined as base-2 wrap around, unlike signed overflow (UB).与有符号溢出 (UB) 不同,无符号溢出被明确定义为 base-2 环绕。 Signed loop counters are generally safe from this with compilers that optimize based on signed-overflow UB not happening: ++i <= size will always eventually become false.带符号的循环计数器通常可以避免这种情况,因为编译器不会基于有符号溢出 UB 进行优化: ++i <= size最终总是会变为 false。 ( What Every C Programmer Should Know About Undefined Behavior ) 每个 C 程序员都应该知道的关于未定义行为的知识

void foo(unsigned size) {
    unsigned upper_bound = size - 1;  // or any calculation that could produce UINT_MAX
    for(unsigned i=0 ; i <= upper_bound ; i++)
        ...

Compilers can only optimize in ways that preserve the (defined and legally observable) behaviour of the C++ source for all possible input values , except ones that lead to undefined behaviour.对于所有可能的输入值,编译器只能以保留 C++ 源代码的(已定义且合法可观察的)行为的方式进行优化,导致未定义行为的除外。

(A simple i <= size would create the problem too, but I thought calculating an upper bound was a more realistic example of accidentally introducing the possibility of an infinite loop for an input you don't care about but which the compiler must consider.) (一个简单的i <= size也会产生问题,但我认为计算上限是一个更现实的例子,它意外地为您不关心但编译器必须考虑的输入引入了无限循环的可能性。 )

In this case, size=0 leads to upper_bound=UINT_MAX , and i <= UINT_MAX is always true.在这种情况下, size=0导致upper_bound=UINT_MAX ,并且i <= UINT_MAX始终为真。 So this loop is infinite for size=0 , and the compiler has to respect that even though you as the programmer probably never intend to pass size=0.所以这个循环对于size=0是无限的,编译器必须尊重这一点,即使你作为程序员可能从不打算传递 size=0。 If the compiler can inline this function into a caller where it can prove that size=0 is impossible, then great, it can optimize like it could for i < size .如果编译器可以将此函数内联到一个调用者中,它可以证明 size=0 是不可能的,那么很好,它可以像i < size一样进行优化。

Asm like if(!size) skip the loop; asm like if(!size) skip the loop; do{...}while(--size); is one normally-efficient way to optimize a for( i<size ) loop, if the actual value of i isn't needed inside the loop ( Why are loops always compiled into "do...while" style (tail jump)? ).是优化for( i<size )循环的一种通常有效的方法,如果循环内不需要i的实际值( 为什么循环总是编译为“do...while”样式(尾跳转)? )。

But that do{}while can't be infinite: if entered with size==0 , we get 2^n iterations.但是这样做{}虽然不能是无限的:如果输入size==0 ,我们会得到 2^n 次迭代。 ( Iterating over all unsigned integers in a for loop C makes it possible to express a loop over all unsigned integers including zero, but it's not easy without a carry flag the way it is in asm.) 在 for 循环 C 中对所有无符号整数进行迭代使得可以在包括零在内的所有无符号整数上表达一个循环,但是没有进位标志就像在 asm 中那样不容易。)

With wraparound of the loop counter being a possibility, modern compilers often just "give up", and don't optimize nearly as aggressively.由于循环计数器的环绕是可能的,现代编译器通常只是“放弃”,并且几乎没有积极地优化。

Example: sum of integers from 1 to n示例:从 1 到 n 的整数之和

Using unsigned i <= n defeats clang's idiom-recognition that optimizes sum(1 .. n) loops with a closed form based on Gauss's n * (n+1) / 2 formula.使用 unsigned i <= n 会破坏 clang 的惯用语识别,该惯用语识别基于高斯的n * (n+1) / 2公式i <= n封闭形式优化sum(1 .. n)循环

unsigned sum_1_to_n_finite(unsigned n) {
    unsigned total = 0;
    for (unsigned i = 0 ; i < n+1 ; ++i)
        total += i;
    return total;
}

x86-64 asm from clang7.0 and gcc8.2 on the Godbolt compiler explorer Godbolt 编译器资源管理器上来自 clang7.0 和 gcc8.2 的 x86-64 asm

 # clang7.0 -O3 closed-form
    cmp     edi, -1       # n passed in EDI: x86-64 System V calling convention
    je      .LBB1_1       # if (n == UINT_MAX) return 0;  // C++ loop runs 0 times
          # else fall through into the closed-form calc
    mov     ecx, edi         # zero-extend n into RCX
    lea     eax, [rdi - 1]   # n-1
    imul    rax, rcx         # n * (n-1)             # 64-bit
    shr     rax              # n * (n-1) / 2
    add     eax, edi         # n + (stuff / 2) = n * (n+1) / 2   # truncated to 32-bit
    ret          # computed without possible overflow of the product before right shifting
.LBB1_1:
    xor     eax, eax
    ret

But for the naive version, we just get a dumb loop from clang.但是对于幼稚的版本,我们只是从 clang 中得到一个愚蠢的循环。

unsigned sum_1_to_n_naive(unsigned n) {
    unsigned total = 0;
    for (unsigned i = 0 ; i<=n ; ++i)
        total += i;
    return total;
}
# clang7.0 -O3
sum_1_to_n(unsigned int):
    xor     ecx, ecx           # i = 0
    xor     eax, eax           # retval = 0
.LBB0_1:                       # do {
    add     eax, ecx             # retval += i
    add     ecx, 1               # ++1
    cmp     ecx, edi
    jbe     .LBB0_1            # } while( i<n );
    ret

GCC doesn't use a closed-form either way, so the choice of loop condition doesn't really hurt it ; GCC 不使用封闭形式,所以循环条件的选择并没有真正伤害它 it auto-vectorizes with SIMD integer addition, running 4 i values in parallel in the elements of an XMM register.它通过 SIMD 整数加法自动矢量化,在 XMM 寄存器的元素中并行运行 4 个i值。

# "naive" inner loop
.L3:
    add     eax, 1       # do {
    paddd   xmm0, xmm1    # vect_total_4.6, vect_vec_iv_.5
    paddd   xmm1, xmm2    # vect_vec_iv_.5, tmp114
    cmp     edx, eax      # bnd.1, ivtmp.14     # bound and induction-variable tmp, I think.
    ja      .L3 #,       # }while( n > i )

 "finite" inner loop
  # before the loop:
  # xmm0 = 0 = totals
  # xmm1 = {0,1,2,3} = i
  # xmm2 = set1_epi32(4)
 .L13:                # do {
    add     eax, 1       # i++
    paddd   xmm0, xmm1    # total[0..3] += i[0..3]
    paddd   xmm1, xmm2    # i[0..3] += 4
    cmp     eax, edx
    jne     .L13      # }while( i != upper_limit );

     then horizontal sum xmm0
     and peeled cleanup for the last n%3 iterations, or something.
     

It also has a plain scalar loop which I think it uses for very small n , and/or for the infinite loop case.它还有一个普通的标量循环,我认为它用于非常小的n和/或无限循环情况。

BTW, both of these loops waste an instruction (and a uop on Sandybridge-family CPUs) on loop overhead.顺便说一句,这两个循环都在循环开销上浪费了一条指令(以及 Sandybridge 系列 CPU 上的一个微指令)。 sub eax,1 / jnz instead of add eax,1 /cmp/jcc would be more efficient. sub eax,1 / jnz而不是add eax,1 /cmp/jcc 会更有效。 1 uop instead of 2 (after macro-fusion of sub/jcc or cmp/jcc). 1 uop 而不是 2(在 sub/jcc 或 cmp/jcc 的宏融合之后)。 The code after both loops writes EAX unconditionally, so it's not using the final value of the loop counter.两个循环之后的代码无条件地写入 EAX,因此它没有使用循环计数器的最终值。

You could say that line is correct in most scripting languages, since the extra character results in slightly slower code processing.您可以说该行在大多数脚本语言中是正确的,因为额外的字符会导致代码处理速度稍慢。 However, as the top answer pointed out, it should have no effect in C++, and anything being done with a scripting language probably isn't that concerned about optimization.但是,正如最重要的答案所指出的那样,它在 C++ 中应该没有效果,并且使用脚本语言所做的任何事情都可能并不关心优化。

Only if the people who created the computers are bad with boolean logic.仅当创建计算机的人不擅长布尔逻辑时。 Which they shouldn't be.他们不应该这样。

Every comparison ( >= <= > < ) can be done in the same speed.每个比较( >= <= > < )都可以以相同的速度完成。

What every comparison is, is just a subtraction (the difference) and seeing if it's positive/negative.每一个比较是什么,只是一个减法(差异),看看它是正面的还是负面的。
(If the msb is set, the number is negative) (如果设置了msb ,则数字为负数)

How to check a >= b ?如何检查a >= b Sub ab >= 0 Check if ab is positive. Sub ab >= 0检查ab是否为正。
How to check a <= b ?如何检查a <= b Sub 0 <= ba Check if ba is positive. Sub 0 <= ba检查ba是否为正。
How to check a < b ?如何检查a < b Sub ab < 0 Check if ab is negative. Sub ab < 0检查ab是否为负数。
How to check a > b ?如何检查a > b Sub 0 > ba Check if ba is negative. Sub 0 > ba检查ba是否为负。

Simply put, the computer can just do this underneath the hood for the given op:简而言之,计算机可以在给定操作的底层执行此操作:

a >= b == msb(ab)==0 a >= b == msb(ab)==0
a <= b == msb(ba)==0 a <= b == msb(ba)==0
a > b == msb(ba)==1 a > b == msb(ba)==1
a < b == msb(ab)==1 a < b == msb(ab)==1

and of course the computer wouldn't actually need to do the ==0 or ==1 either.当然,计算机实际上也不需要执行==0==1
for the ==0 it could just invert the msb from the circuit.对于==0 ,它可以将电路中的msb反转。

Anyway, they most certainly wouldn't have made a >= b be calculated as a>b || a==b无论如何,他们肯定不会将a >= b计算为a>b || a==b a>b || a==b lol a>b || a==b哈哈

In C and C++, an important rule for the compiler is the “as-if” rule: If doing X has the exact same behavior as if you did Y, then the compiler is free to choose which one it uses.在 C 和 C++ 中,编译器的一条重要规则是“as-if”规则:如果执行 X 与执行 Y 具有完全相同的行为,那么编译器可以自由选择它使用哪一个。

In your case, “a < 901” and “a <= 900” always have the same result, so the compiler is free to compile either version.在您的情况下,“a < 901”和“a <= 900”始终具有相同的结果,因此编译器可以自由编译任一版本。 If one version was faster, for whatever reason, then any quality compiler would produce code for the version that is faster.如果某个版本更快,无论出于何种原因,任何高质量的编译器都会为更快的版本生成代码。 So unless your compiler produced exceptionally bad code, both versions would run at equal speed.因此,除非您的编译器生成异常糟糕的代码,否则两个版本将以相同的速度运行。

Now if you had a situation where two bits of code will always produce the same result, but it is hard to prove for the compiler, and/or it is hard for the compiler to prove which if any version is faster, then you might get different code running at different speeds.现在,如果您遇到两种代码总是会产生相同结果的情况,但是很难为编译器证明,和/或编译器很难证明哪个版本更快,那么您可能会得到不同的代码以不同的速度运行。

PS The original example might run at different speeds if the processor supports single byte constants (faster) and multi byte constants (slower), so comparing against 255 (1 byte) might be faster than comparing against 256 (two bytes). PS 如果处理器支持单字节常量(更快)和多字节常量(更慢),原始示例可能以不同的速度运行,因此与 255(1 个字节)进行比较可能比与 256(2 个字节)进行比较更快。 I'd expect the compiler to do whatever is faster.我希望编译器能做任何更快的事情。

Only if computation path depends on data:仅当计算路径取决于数据时:

a={1,1,1,1,1000,1,1,1,1}
while (i<=4)
{
     for(j from 0 to a[i]){ do_work(); }
     i++;
}

will compute 250 times more than while(i<4)将比while(i<4)多计算 250 倍

Real-world sample would be computing mandelbrot-set.真实世界的样本将是计算 mandelbrot 集。 If you include a pixel that iterates 1000000 times, it will cause a lag but the coincidence with <= usage probability is too low.如果包含一个迭代 1000000 次的像素,它会导致延迟,但与<=使用概率的重合度太低。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM