简体   繁体   English

嵌套循环内外变量声明的性能比较

[英]Performance comparison between variable declaration within or outside nested loop

case I案例一

while (!channels_.empty())
            {
                for (auto it = channels_.begin(); it != channels_.end(); ++it)
                {
                    time_t stop_time;
                    if (it->second->active_playlist()->status_stop_time(it->second->on_air_row(), stop_time))
                    {

                    }
                }
            }

case II案例二

while (!channels_.empty())
        {
            time_t stop_time;
            for (auto it = channels_.begin(); it != channels_.end(); ++it)
            {

                if (it->second->active_playlist()->status_stop_time(it->second->on_air_row(), stop_time))
                {

                }
            }
        }

Variable stop_time is declared outside or within the nested loop in Case I and II.在案例 I 和案例 II 中,变量 stop_time 在嵌套循环的外部或内部声明。 Which one is better in terms of performance ?哪一个在性能方面更好? Why?为什么?

The standard has very little to say about performance but there is one clear rule, that optimizations must not change observable side effects.该标准几乎没有提及性能,但有一个明确的规则,即优化不得改变可观察到的副作用。

The standard does not describe stack or heap usage, so it would be perfectly legitimate for a compiler to allocate space for a variable on the stack at any point prior to its use.该标准没有描述堆栈或堆的使用,因此编译器在使用之前的任何时候在堆栈上为变量分配空间是完全合法的。

But optimal would depend on various factors.但最佳将取决于各种因素。 On the most common architectures, it makes most sense to do all your stack pointer adjustments in just 2 places - entry and exit.在最常见的架构上,在两个地方进行所有堆栈指针调整是最有意义的 - 进入和退出。 There is no cost difference on an x86 to changing the stack pointer by 640 instead on 8.在 x86 上将堆栈指针更改 640 而不是 8 没有成本差异。

Further, if the compiler can be sure the value doesn't change, then the optimizer may be able to hoist the assignment out of the loop too.此外,如果编译器可以确定该值不会改变,那么优化器也可以将赋值提升到循环之外。

In practice, mainstream compilers (gcc, clang, msvc) on x86 and arm based platforms will aggregate stack allocations into single up and downs, and hoist loop invariants given sufficient optimizer settings/arguments.在实践中,x86 和基于 arm 的平台上的主流编译器(gcc、clang、msvc)会将堆栈分配聚合为单个向上和向下,并在给定足够的优化器设置/参数的情况下提升循环不变量。

If in doubt, inspect the assembly or benchmark.如有疑问,请检查组件或基准。

We can very quickly demonstrate this with godbolt :我们可以用Godbolt快速证明这一点

#include <vector>

struct Channel
{
  void test(int&);
};

std::vector<Channel> channels;

void test1()
{
  while (!channels.empty())
  {
    for (auto&& channel : channels)
    {
      int stop_time;
      channel.test(stop_time);
    }
  }
}

void test2()
{
  while (!channels.empty())
  {
    int stop_time;
    for (auto&& channel : channels)
    {
      channel.test(stop_time);
    }
  }
}

void test3()
{
  int stop_time;
  while (!channels.empty())
  {
    for (auto&& channel : channels)
    {
      channel.test(stop_time);
    }
  }
}

With GCC 5.1 and -O3 this generates three identical pieces of assembly :使用 GCC 5.1 和 -O3 这会生成三个相同的程序集

    test1():
            pushq   %rbp
            pushq   %rbx
            subq    $24, %rsp
    .L8:
            movq    channels+8(%rip), %rbp
            movq    channels(%rip), %rbx
            cmpq    %rbp, %rbx
            je      .L10
    .L7:
            leaq    12(%rsp), %rsi
            movq    %rbx, %rdi
            addq    $1, %rbx
            call    Channel::test(int&)
            cmpq    %rbx, %rbp
            jne     .L7
            jmp     .L8
    .L10:
            addq    $24, %rsp
            popq    %rbx
            popq    %rbp
            ret

    test2():
            pushq   %rbp
            pushq   %rbx
            subq    $24, %rsp
    .L22:
            movq    channels+8(%rip), %rbp
            movq    channels(%rip), %rbx
            cmpq    %rbp, %rbx
            je      .L20
    .L14:
            leaq    12(%rsp), %rsi
            movq    %rbx, %rdi
            addq    $1, %rbx
            call    Channel::test(int&)
            cmpq    %rbx, %rbp
            jne     .L14
            jmp     .L22
    .L20:
            addq    $24, %rsp
            popq    %rbx
            popq    %rbp
            ret

    test3():
            pushq   %rbp
            pushq   %rbx
            subq    $24, %rsp
    .L26:
            movq    channels+8(%rip), %rbp
            movq    channels(%rip), %rbx
            cmpq    %rbp, %rbx
            je      .L28
    .L25:
            leaq    12(%rsp), %rsi
            movq    %rbx, %rdi
            addq    $1, %rbx
            call    Channel::test(int&)
            cmpq    %rbx, %rbp
            jne     .L25
            jmp     .L26
    .L28:
            addq    $24, %rsp
            popq    %rbx
            popq    %rbp
            ret

A general answer for many performance-related questions is "measure it and see for yourself".许多与性能相关的问题的一般答案是“衡量它并亲自查看”。 If it's easy, just do it.如果这很容易,那就去做吧。 In this case, it is easy.在这种情况下,这容易。

Sometimes, it's good to look at assembly code - if the code is identical in your two cases (I guess it is), you don't even need to measure its performance.有时,查看汇编代码是件好事——如果代码在您的两种情况下相同(我猜是这样),您甚至不需要测量其性能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM