简体   繁体   English

嵌套函数调用速度与否?

[英]nested function call faster or not?

I have this silly argument with a friend and need an authoritative word on it. 我和朋友有这个愚蠢的争论,需要一个权威的话。

I have these two snippet and want to know which one is faster ? 我有这两个片段,想知道哪一个更快? [A or B] [A或B]

(assuming that compiler does not optimize anything) (假设编译器没有优化任何东西)

[A] [一种]

if ( foo () ); 

[B] [B]

int t = foo ();
if ( t )

EDIT : Guys, this might look a silly question to you but I have a hardware engineer friend , who was arguing that even WITHOUT optimization (take any processor, any compiler pair) CASE B is always faster because it DOES NOT fetch the memory for the result from previous instruction but directly accesses result from Common Data Bus by bypassing that data (remember the 5-stage pipeline). 编辑 :伙计们,这对你来说可能看起来很愚蠢,但我有一位硬件工程师朋友 ,他认为即使没有优化(采取任何处理器,任何编译器对),CASE B总是更快,因为它不会获取内存来自先前指令的结果,但通过绕过该数据直接访问公共数据总线的结果(记住5级流水线)。

While My Argument was that, without compiler informing how much data to copy or check, it is not possible to do that(you have to go to memory to get the data, WITHOUT compiler optimizing that) 虽然我的论点是这样的,没有编译器通知要复制或检查多少数据,但是不可能这样做(你必须转到内存来获取数据,没有编译器优化)

The "optimisation" required to convert [B] into [A] is so trivial (especially if t is not used anywhere else) that the compiler probably won't even call it an optimisation. 将[B]转换为[A]所需的“优化”是如此微不足道(特别是如果t不在其他任何地方使用),编译器可能甚至不会其称为优化。 It might be something that it just does as a matter of course, whether or not optimisations are explicitly enabled. 当然,无论是否明确启用优化,它都可能是它所做的事情。

The only way to tell is to ask your compiler to generate an assembly source listing for both bits of code, then compare them. 要告诉的唯一方法是让编译器为两个代码位生成汇编源列表,然后比较它们。

Executive Summary 执行摘要
1. We are talking about nanoseconds. 我们谈论的是纳秒。 Light moves a whopping 30cm in that time. 在那段时间里,光线移动了30厘米。 2. Sometimes, if you are really lucky, [A] is faster 有时候,如果你真的很幸运,[A]会更快


Side note: [B] may have a different meaning 旁注:[B]可能有不同的含义
if the return type of foo is not int but an object that has implicit conversions to both int and bool , different code paths are executed. 如果foo的返回类型不是int而是对intbool都进行隐式转换的对象,则执行不同的代码路径。 One might contain a Sleep . 一个可能包含Sleep

Assuming a function returning int: 假设一个函数返回int:

Depends on the compiler 取决于编译器
Even with the restriction of "no optimization", there is no guarantee how the generated code will look like. 即使受到“无优化”的限制,也无法保证生成的代码的外观如何。 B could be 10 times faster and the compiler would still be compliant (and you most likely wouldn't notice). B可能快10倍,编译器仍然符合要求(你很可能不会注意到)。

Depends on the hardware 取决于硬件
Depending on your architecture, there might not even be a difference for the generated code, no matter how much your compiler tries. 根据您的体系结构,无论您的编译器尝试了多少,生成的代码都可能没有区别。

Assuming a modern compiler on a modern x86 / x64 architecture: 假设在现代x86 / x64架构上使用现代编译器:

On typical compilers, the difference is at most miniscule 在典型的编译器上,差异至多是微不足道的
that stores t in a stack variable, the two extra stack loads typically cost 2 clock cycles (less than a nanosecond on my CPU). t存储在堆栈变量中,两个额外的堆栈负载通常需要2个时钟周期(在我的CPU上不到一纳秒)。 That is negligible compared to the "surrounding cost" - a call to foo , the cost of foo itself, and a branch. 与“周围成本”相比,这是微不足道的 - 对foo的调用, foo本身的成本以及分支。 An unoptimized call with a full stack frame can easily cost you 20.200 cycles depending on patform. 使用完整堆栈帧的未优化调用很容易花费20.200个周期,具体取决于patform。

For comparison: cycle cost of a single memory access that is not in 1st level cache (roughly: 100 cycles from 2nd level, 1000 from main, hundreds of thousands from disk) 用于比较:单个内存访问的周期成本不在第一级缓存中(大约:从第二级开始100个周期,从主要开始1000个,从磁盘开始数十万个)

...or even nonexistent ......甚至不存在
Even if your compiler isn't optimizing, your CPU might. 即使你的编译器没有优化,你的CPU也可能。 Due to pairing / microcode generation, the cycle cost may actually be identical. 由于配对/微码生成,循环成本实际上可能是相同的。

For the record, gcc, when compiling with optimization specifically disabled ( -O0 ), produces different code for the two inputs (in my case, the body of foo was return rand(); so that the result would not be determined at compile time). 对于记录,gcc在使用特别禁用的优化( -O0 )进行编译时,会为两个输入生成不同的代码(在我的情况下, foo的主体是return rand();因此在编译时不会确定结果)。

Without temporary variable t : 没有临时变量t

        movl    $0, %eax
        call    foo
        testl   %eax, %eax
        je      .L4
        /* inside of if block */
.L4:
        /* rest of main() */

Here, the return value of foo is stored in the EAX register, and the register is tested against itself to see if it is 0, and if so, it jumps over the body of the if block. 这里, foo的返回值存储在EAX寄存器中,寄存器针对自身进行测试以查看它是否为0,如果是,则跳过if块的主体。

With temporary variable t : 使用临时变量t

        movl    $0, %eax
        call    foo
        movl    %eax, -4(%rbp)
        cmpl    $0, -4(%rbp)
        je      .L4
        /* inside of if block */
.L4:
        /* rest of main() */

Here, the return value of foo is stored in the EAX register, then pushed onto the stack. 这里, foo的返回值存储在EAX寄存器中,然后压入堆栈。 Then, the contents of the location on the stack are compared to literal 0, and if they are equal, it jumps over the body of the if block. 然后,将堆栈上的位置内容与文字0进行比较,如果它们相等,则跳过if块的主体。

And so if we assume further that the processor is not doing any "optimizations" when it generates the microcode for this, then the version without the temporary should be a few clock cycles faster. 因此,如果我们进一步假设处理器在为此生成微码时没有进行任何“优化”,那么没有临时的版本应该更快几个时钟周期。 It's not going to be substantially faster because even though the version with a temporary involves a stack push, the stack value is almost certainly still going to be in the processor's L1 cache when the comparison instruction is executed immediately afterwords, and so there's not going to be a round trip to RAM. 它不会快得多,因为即使具有临时的版本涉及堆栈推送,当比较指令在字之后立即执行时,堆栈值几乎肯定仍然在处理器的L1缓存中,因此不会是一次往返RAM的往返旅程。

Of course the code becomes identical as soon as you turn on any optimization level, even -O1 , and who compiles anything that is so critical that they care about a handful of clock cycles with all optimizations off? 当然,只要打开任何优化级别,即使是-O1 ,代码就会变得相同,并且编译任何非常关键的东西,以至于他们关心所有优化关闭的少数时钟周期?

Edit: With regard to your further information about your hardware engineer friend, I can't see how accessing a value in the L1 cache would ever be faster than accessing a register directly. 编辑:关于您有关硬件工程师朋友的更多信息,我无法看到如何访问L1缓存中的值比直接访问寄存器更快 I could see it being just about as fast if the value never even leaves the pipeline, but I can't see it being faster , especially since it still has to execute the movl instruction in addition to the comparison. 我可以看到它是只是尽可能快 ,如果该值甚至从来没有离开管道,但我不能看到它速度更快 ,特别是因为它仍然具有执行movl除了比较指令。 But show him the assembly code above and ask what he thinks; 但是告诉他上面的汇编代码并询问他的想法; it will be more productive than trying to discuss the problem in terms of C. 它会比用C来讨论问题更有成效。

They are likely both going to be the same. 它们可能都是一样的。 That int will be stored into a register in either case. 在任何一种情况下,int都将存储到寄存器中。

It really depends on how the compiler is built. 这实际上取决于编译器的构建方式。 But I think in most cases, A will be faster. 但我认为在大多数情况下,A会更快。 Here's why: 原因如下:

In B, the compiler might not bother finding out whether t is ever used again, so it will be forced to preserve the value after the if statement. 在B中,编译器可能不会发现是否再次使用t,因此它将被强制保留if语句之后的值。 And that could mean pushing it onto the stack. 这可能意味着将其推入堆栈。

A will likely be just a tiny bit faster because it does not do a variable assignment. A可能会快一点,因为它不会进行变量赋值。 The difference we're talking about is way too small to measure . 我们谈论的差异太小,无法衡量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM