简体   繁体   English

为什么内联被认为比函数调用更快?

[英]Why is inlining considered faster than a function call?

Now, I know it's because there's not the overhead of calling a function, but is the overhead of calling a function really that heavy (and worth the bloat of having it inlined) ?现在,我知道这是因为没有调用函数的开销,但是调用函数的开销真的那么重(值得内联它的膨胀)?

From what I can remember, when a function is called, say f(x,y), x and y are pushed onto the stack, and the stack pointer jumps to an empty block, and begins execution.根据我的记忆,当一个函数被调用时,比如说 f(x,y),x 和 y 被压入堆栈,堆栈指针跳转到一个空块,并开始执行。 I know this is a bit of an oversimplification, but am I missing something?我知道这有点过于简单化,但我是否遗漏了什么? A few pushes and a jump to call a function, is there really that much overhead?几次推送和跳转调用一个函数,真的有那么多开销吗?

Let me know if I'm forgetting something, thanks!如果我忘记了什么,请告诉我,谢谢!

Aside from the fact that there's no call (and therefore no associated expenses, like parameter preparation before the call and cleanup after the call), there's another significant advantage of inlining.除了没有调用(因此没有相关费用,例如调用前的参数准备和调用后的清理)这一事实之外,内联还有另一个显着优势。 When the function body is inlined, it's body can be re-interpreted in the specific context of the caller.当函数体被内联时,它的体可以在调用者的特定上下文中重新解释。 This might immediately allow the compiler to further reduce and optimize the code.这可能会立即允许编译器进一步减少和优化代码。

For one simple example, this function举一个简单的例子,这个函数

void foo(bool b) {
  if (b) {
    // something
  }
  else {
    // something else
  }
}

will require actual branching if called as a non-inlined function如果作为非内联函数调用,将需要实际分支

foo(true);
...
foo(false);

However, if the above calls are inlined, the compiler will immediately be able to eliminate the branching.但是,如果上述调用被内联,编译器将立即能够消除分支。 Essentially, in the above case inlining allows the compiler to interpret the function argument as a compile-time constant (if the parameter is a compile-time constant) - something that is generally not possible with non-inlined functions.本质上,在上述情况下,内联允许编译器将函数参数解释为编译时常量(如果参数是编译时常量)——这对于非内联函数通常是不可能的。

However, it is not even remotely limited to that.然而,它甚至远不止于此。 In general, the optimization opportunities enabled of inlining are significantly more far-reaching.一般来说,内联启用的优化机会要深远得多。 For another example, when the function body is inlined into the specific caller's context, the compiler in general case will be able to propagate the known aliasing-related relationships present in the calling code into the inlined function code, thus making it possible to optimize the function's code better.再比如,当函数体被内联到特定调用者的上下文中时,编译器在一般情况下将能够将调用代码中存在的已知别名相关关系传播到内联函数代码中,从而可以优化函数的代码更好。

Again, the possible examples are numerous, all of them stemming from the basic fact that inlined calls are immersed into the specific caller's context, thus enabling various inter-context optimizations, which would not be possible with non-inlined calles.同样,可能的例子很多,所有这些都源于这样一个基本事实,即内联调用沉浸在特定调用者的上下文中,从而实现了各种上下文间优化,这对于非内联调用是不可能的。 With inlining you basically get many individual versions of your original function, each version is tailored and optimized individually for each specific caller context.通过内联,您基本上可以获得原始函数的许多单独版本,每个版本都针对每个特定的调用者上下文单独定制和优化。 The price of that is, obviously, the potential danger of code bloat, but if used correctly, it can provide noticeable performance benefits.显然,这样做的代价是代码膨胀的潜在危险,但如果使用得当,它可以提供显着的性能优势。

"A few pushes and a jump to call a function, is there really that much overhead?" “几推一跳调用一个函数,真的有那么多开销吗?”

It depends on the function.这取决于功能。

If the body of the function is just one machine code instruction, the call and return overhead can be many many hundred %.如果函数体只是一条机器代码指令,则调用和返回的开销可能高达数百%。 Say, 6 times, 500% overhead.比如说,6 次,500% 的开销。 Then if your program consists of nothing but a gazillion calls to that function, with no inlining you've increased the running time by 500%.然后,如果您的程序只包含对该函数的无数次调用,并且没有内联,那么您的运行时间就增加了 500%。

However, in the other direction inlining can have a detrimental effect, eg because code that without inlining would fit in one page of memory doesn't.但是,在另一个方向上,内联可能会产生不利影响,例如因为没有内联的代码无法放入一页内存中。

So the answer is always when it comes to optimization, first of all MEASURE.所以答案总是在优化方面,首先是 MEASURE。

There is no calling and stack activity, which certainly saves a few CPU cycles.没有调用和堆栈活动,这当然节省了几个 CPU 周期。 In modern CPU's, code locality also matters: doing a call can flush the instruction pipeline and force the CPU to wait for memory being fetched.在现代 CPU 中,代码局部性也很重要:调用可以刷新指令管道并强制 CPU 等待内存被获取。 This matters a lot in tight loops, since primary memory is quite a lot slower than modern CPU's.这在紧密循环中很重要,因为主内存比现代 CPU 慢得多。

However, don't worry about inlining if your code is only being called a few times in your application.但是,如果您的代码在您的应用程序中只被调用了几次,请不要担心内联。 Worry, a lot, if it's being called millions of times while the user waits for answers!非常担心,如果在用户等待答案时它被调用了数百万次!

The classic candidate for inlining is an accessor, like std::vector<T>::size() .内联的经典候选者是访问器,例如std::vector<T>::size()

With inlining enabled this is just the fetching of a variable from memory, likely a single instruction on any architectures.启用内联后,这只是从内存中获取变量,可能任何架构上的单个指令 The "few pushes and a jump" (plus the return) is easily multiple times as much. “几次推动和跳跃”(加上返回)很容易成倍增加。

Add to that the fact that, the more code is visible at once to an optimizer, the better it can do its work.除此之外,优化器一次可见的代码越多,它的工作就越好。 With lots of inlining, it sees lots of code at once.通过大量内联,它可以同时看到大量代码。 That means that it might be able to keep the value in a CPU register , and completely spare the costly trip to memory.这意味着它可能能够将值保存在 CPU 寄存器中,并完全省去昂贵的内存之旅。 Now we might take about a difference of several orders of magnitude .现在我们可能需要几个数量级的差异。

And then theres template meta-programming .然后是模板元编程 Sometimes this results in calling many small functions recursively, just to fetch a single value at the end of the recursion.有时这会导致递归调用许多小函数,只是为了在递归结束时获取单个值。 (Think of fetching the value of the first entry of a specific type in a tuple with dozens of objects.) With inlining enabled, the optimizer can directly access that value (which, remember, might be in a register), collapsing dozens of function calls into accessing a single value in a CPU register. (想想在一个包含数十个对象的元组中获取特定类型的第一个条目的值。)启用内联后,优化器可以直接访问该值(记住,可能在寄存器中),折叠数十个函数调用访问 CPU 寄存器中的单个值。 This can turn a terrible performance hog into a nice and speedy program.这可以将一个糟糕的性能猪变成一个漂亮而快速的程序。


Hiding state as private data in objects (encapsulation) has its costs.将状态作为私有数据隐藏在对象中(封装)是有代价的。 Inlining was part of C++ from the very beginning in order to minimize these costs of abstraction .内联从一开始就是 C++ 的一部分,目的是将这些抽象成本降到最低 Back then, compilers were significantly worse in detecting good candidates for inlining (and rejecting bad ones) than they are today, so manually inlining resulted in considerable speed gainings.那时,编译器在检测内联(并拒绝坏的)候选者方面比现在差得多,因此手动内联导致了相当大的速度提升。
Nowadays compilers are reputed to be much more clever than we are about inline.如今,编译器被认为比内联更聪明。 Compilers are able to inline functions automatically or don't inline functions users marked as inline , even though they could.编译器能够自动内联函数或不内联用户标记为inline函数,即使他们可以。 Some say that inlining should be left to the compiler completely and we shouldn't even bother marking functions as inline .有人说内联应该完全留给编译器,我们甚至不应该将函数标记为inline However, I have yet to see a comprehensive study showing whether manually doing so is still worth it or not.但是,我还没有看到一项全面的研究表明手动这样做是否仍然值得。 So for the time being, I'll keep doing it myself, and let the compiler override that if it thinks it can do better.所以暂时,我会继续自己做,如果编译器认为它可以做得更好,让编译器覆盖它。

let

int sum(const int &a,const int &b)
{
     return a + b;
}
int a = sum(b,c);

is equal to等于

int a = b + c

No jump - no overhead没有跳跃 - 没有开销

Consider a simple function like:考虑一个简单的函数,如:

int SimpleFunc (const int X, const int Y)
{
    return (X + 3 * Y); 
}    

int main(int argc, char* argv[])
{
    int Test = SimpleFunc(11, 12);
    return 0;
}

This is converted to the following code (MSVC++ v6, debug):这将转换为以下代码(MSVC++ v6,调试):

10:   int SimpleFunc (const int X, const int Y)
11:   {
00401020   push        ebp
00401021   mov         ebp,esp
00401023   sub         esp,40h
00401026   push        ebx
00401027   push        esi
00401028   push        edi
00401029   lea         edi,[ebp-40h]
0040102C   mov         ecx,10h
00401031   mov         eax,0CCCCCCCCh
00401036   rep stos    dword ptr [edi]

12:       return (X + 3 * Y);
00401038   mov         eax,dword ptr [ebp+0Ch]
0040103B   imul        eax,eax,3
0040103E   mov         ecx,dword ptr [ebp+8]
00401041   add         eax,ecx

13:   }
00401043   pop         edi
00401044   pop         esi
00401045   pop         ebx
00401046   mov         esp,ebp
00401048   pop         ebp
00401049   ret

You can see that there are just 4 instructions for the function body but 15 instructions for just the function overhead not including another 3 for calling the function itself.你可以看到函数体只有 4 条指令,但只有 15 条指令用于函数开销,不包括另外 3 条用于调用函数本身的指令。 If all instructions took the same time (they don't) then 80% of this code is function overhead.如果所有指令都花费相同的时间(它们没有),那么此代码的 80% 是函数开销。

For a trivial function like this there is a good chance that the function overhead code will take just as long to run as the main function body itself.对于像这样的微不足道的函数,函数开销代码很可能与主函数体本身的运行时间一样长。 When you have trivial functions that are called in a deep loop body millions/billions of times then the function call overhead begins to become large.当您在深循环体中调用数百万/数十亿次的琐碎函数时,函数调用开销开始变大。

As always, the key is profiling/measuring to determine whether or not inlining a specific function yields any net performance gains.与往常一样,关键是分析/测量以确定内联特定函数是否会产生任何净性能增益。 For more "complex" functions that are not called "often" the gain from inlining may be immeasurably small.对于不“经常”调用的更“复杂”的函数,内联的收益可能小得无法估量。

There are multiple reasons for inlining to be faster, only one of which is obvious:内联更快的原因有很多,其中只有一个是显而易见的:

  • No jump instructions.没有跳转指令。
  • better localization, resulting in better cache utilization.更好的本地化,从而提高缓存利用率。
  • more chances for the compiler's optimizer to make optimizations, leaving values in registers for example.编译器的优化器有更多机会进行优化,例如将值留在寄存器中。

The cache utilization can also work against you - if inlining makes the code larger, there's more possibility of cache misses.缓存利用率也可能对您不利 - 如果内联使代码更大,则缓存未命中的可能性更大。 That's a much less likely case though.不过,这种情况的可能性要小得多。

A typical example of where it makes a big difference is in std::sort which is O(N log N) on its comparison function.它产生很大差异的一个典型例子是 std::sort ,它的比较函数是 O(N log N) 。

Try creating a vector of a large size and call std::sort first with an inline function and second with a non-inlined function and measure the performance.尝试创建一个大尺寸的向量并首先使用内联函数调用 std::sort ,然后使用非内联函数调用并测量性能。

This, by the way, is where sort in C++ is faster than qsort in C, which requires a function pointer.顺便说一下,这就是 C++ 中的 sort 比 C 中的 qsort 更快的地方,后者需要一个函数指针。

跳转的另一个潜在副作用是,您可能会触发页面错误,或者是第一次将代码加载到内存中,或者如果它的使用频率不够高以至于稍后会被调出内存。

(and worth the bloat of having it inlined) (并且值得内联它的膨胀)

It is not always the case that in-lining results in larger code.内联导致更大的代码并不总是如此。 For example a simple data access function such as:例如一个简单的数据访问函数,例如:

int getData()
{
   return data ;
}

will result in significantly more instruction cycles as a function call than as an in-line, and such functions are best suited to in-lining.将导致作为函数调用的指令周期明显多于作为内联的指令周期,并且此类函数最适合于内联。

If the function body contains a significant amount of code the function call overhead will indeed be insignificant, and if it is called from a number of locations, it may indeed result in code bloat - although your compiler is as likely to simply ignore the inline directive in such cases.如果函数体包含大量代码,函数调用开销确实微不足道,如果从多个位置调用它,确实可能导致代码膨胀——尽管您的编译器可能会简单地忽略内联指令在这种情况下。

You should also consider the frequency of calling;您还应该考虑调用的频率; even for a large-ish code body, if the function is called frequently from one location, the saving may in some cases be worthwhile.即使对于大型代码体,如果从一个位置频繁调用该函数,在某些情况下节省可能是值得的。 It comes down to the ratio of call-overhead to code body size, and the frequency of use.这归结为调用开销与代码体大小的比率,以及使用频率。

Of course you could just leave it up to your compiler to decide.当然,您可以将其留给编译器来决定。 I only ever explicitly in-line functions that comprise of a single statement not involving a further function call, and that is more for speed of development of class methods than for performance.我只明确地将包含单个语句的内联函数不涉及进一步的函数调用,这更多的是为了类方法的开发速度而不是性能。

Andrey's answer already gives you a very comprehensive explanation.安德烈的回答已经给你一个非常全面的解释。 But just to add one point that he missed, inlining can also be extremely valuable on very short functions.但只是补充一点,他错过了,内联在非常短的函数中也非常有价值。

If a function body consists of just a few instructions, then the prologue/epilogue code (the push/pop/call instructions, basically) might actually be more expensive than the function body itself.如果函数体仅由几条指令组成,那么序言/尾声代码(基本上是推送/弹出/调用指令)实际上可能比函数体本身更昂贵。 If you call such a function often (say, from a tight loop), then unless the function is inlined, you can end up spending the majority of your CPU time on the function call, rather than the actual contents of the function.如果您经常调用这样的函数(例如,从紧密循环中),那么除非该函数是内联的,否则您最终可能会将大部分 CPU 时间花在函数调用上,而不是函数的实际内容上。

What matters isn't really the cost of a function call in absolute terms (where it might take just 5 clock cycles or something like that), but how long it takes relative to how often the function is called.重要的不是函数调用的绝对成本(可能只需要 5 个时钟周期或类似的时间),而是相对于调用函数的频率需要多长时间。 If the function is so short that it can be called every 10 clock cycles, then spending 5 cycles for every call on "unnecessary" push/pop instructions is pretty bad.如果函数太短以至于可以每 10 个时钟周期调用一次,那么每次调用“不必要的”推送/弹出指令都要花费 5 个周期是非常糟糕的。

Because there's no call.因为没有电话。 The function code is just copied只是复制了功能码

Inlining a function is a suggestion to compiler to replace function call with definition.内联函数是建议编译器用定义替换函数调用。 If its replaced, then there will be no function calling stack operations [push, pop].如果它被替换,那么将没有函数调用堆栈操作 [push, pop]。 But its not guaranteed always.但它并不总是保证。 :) :)

--Cheers --干杯

Optimizing compilers apply a set of heuristics to determine whether or not inlining will be beneficial.优化编译器应用一组启发式方法来确定内联是否有益。

Sometimes gain from the lack of function call will outweigh the potential cost of the extra code, sometimes not.有时从缺少函数调用中获得的收益会超过额外代码的潜在成本,有时则不会。

当一个函数被多次调用时,内联会产生很大的不同。

因为没有执行跳转。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM