简体   繁体   English

循环展开 - G ++与Clang ++

[英]Loop unrolling - G++ vs. Clang++

I was wondering whether it is worth to aid the compiler with templates to unroll a simple loop. 我想知道是否值得帮助编译器使用模板来展开一个简单的循环。 I prepared the following test: 我准备了以下测试:

#include <cstdlib>
#include <utility>
#include <array>

class TNode
{
public:
  void Assemble();
  void Assemble(TNode const *);
};

class T
{
private:
  std::array<TNode *,3u> NodePtr;

private:
  template <std::size_t,std::size_t>
  void foo() const;

  template <std::size_t... ij>
  void foo(std::index_sequence<ij...>) const
    { (foo<ij%3u,ij/3u>(),...); }

public:
  void foo() const
    { return foo(std::make_index_sequence<3u*3u>{}); }

  void bar() const;
};

template <std::size_t i,std::size_t j>
inline void T::foo() const
{
if constexpr (i==j)
  NodePtr[i]->Assemble();
else
  NodePtr[i]->Assemble(NodePtr[j]);
}

inline void T::bar() const
{
for (std::size_t i= 0u; i<3u; ++i)
  for (std::size_t j= 0u; j<3u; ++j)
    if (i==j)
      NodePtr[i]->Assemble();
    else
      NodePtr[i]->Assemble(NodePtr[j]);
}

void foo()
{
T x;
x.foo();
}

void bar()
{
T x;
x.bar();
}

I first tried with G++ with -O3 -funroll-loops enabled and I got ( https://godbolt.org/z/_Wyvl8 ): 我首先尝试使用G ++并启用了-O3 -funroll-loops然后我得到了( https://godbolt.org/z/_Wyvl8 ):

foo():
        push    r12
        push    rbp
        push    rbx
        sub     rsp, 32
        mov     r12, QWORD PTR [rsp]
        mov     rdi, r12
        call    TNode::Assemble()
        mov     rbp, QWORD PTR [rsp+8]
        mov     rsi, r12
        mov     rdi, rbp
        call    TNode::Assemble(TNode const*)
        mov     rbx, QWORD PTR [rsp+16]
        mov     rsi, r12
        mov     rdi, rbx
        call    TNode::Assemble(TNode const*)
        mov     rsi, rbp
        mov     rdi, r12
        call    TNode::Assemble(TNode const*)
        mov     rdi, rbp
        call    TNode::Assemble()
        mov     rsi, rbp
        mov     rdi, rbx
        call    TNode::Assemble(TNode const*)
        mov     rsi, rbx
        mov     rdi, r12
        call    TNode::Assemble(TNode const*)
        mov     rdi, rbp
        mov     rsi, rbx
        call    TNode::Assemble(TNode const*)
        add     rsp, 32
        mov     rdi, rbx
        pop     rbx
        pop     rbp
        pop     r12
        jmp     TNode::Assemble()
bar():
        push    r13
        push    r12
        push    rbp
        xor     ebp, ebp
        push    rbx
        sub     rsp, 40
.L9:
        mov     r13, QWORD PTR [rsp+rbp*8]
        xor     ebx, ebx
        lea     r12, [rbp+1]
.L5:
        cmp     rbp, rbx
        je      .L15
        mov     rsi, QWORD PTR [rsp+rbx*8]
        mov     rdi, r13
        add     rbx, 1
        call    TNode::Assemble(TNode const*)
        cmp     rbx, 3
        jne     .L5
        mov     rbp, r12
        cmp     r12, 3
        jne     .L9
.L16:
        add     rsp, 40
        pop     rbx
        pop     rbp
        pop     r12
        pop     r13
        ret
.L15:
        mov     rdi, r13
        mov     rbx, r12
        call    TNode::Assemble()
        cmp     r12, 3
        jne     .L5
        mov     rbp, r12
        cmp     r12, 3
        jne     .L9
        jmp     .L16

I can't read assembly, but I seem to understand that the templated version does unroll the loop, while bar has loops and branches. 我无法阅读汇编,但我似乎明白模板版本会展开循环,而bar有循环和分支。

Then I tried with Clang++ ( https://godbolt.org/z/VCNb65 ) and I got a very different picture: 然后我尝试使用Clang ++( https://godbolt.org/z/VCNb65 ),我得到了一个非常不同的图片:

foo():                                # @foo()
        push    rax
        call    TNode::Assemble()
        call    TNode::Assemble(TNode const*)
        call    TNode::Assemble(TNode const*)
        call    TNode::Assemble(TNode const*)
        call    TNode::Assemble()
        call    TNode::Assemble(TNode const*)
        call    TNode::Assemble(TNode const*)
        call    TNode::Assemble(TNode const*)
        pop     rax
        jmp     TNode::Assemble()    # TAILCALL
bar():                                # @bar()
        push    rax
        call    TNode::Assemble()
        call    TNode::Assemble(TNode const*)
        call    TNode::Assemble(TNode const*)
        call    TNode::Assemble(TNode const*)
        call    TNode::Assemble()
        call    TNode::Assemble(TNode const*)
        call    TNode::Assemble(TNode const*)
        call    TNode::Assemble(TNode const*)
        pop     rax
        jmp     TNode::Assemble()    # TAILCALL

What happened here? 这里发生了什么? How can the resulting assembly be so terse? 最终的装配如何如此简洁?

  1. NodePtr is not initialized, and when you use it, it is UB. NodePtr未初始化,当您使用它时,它是UB。 So the optimizer can do whatever it wants: here it decides to omit assignments to the register esi/rsi , which is used to pass an argument to TNode::Assemble(TNode const*) , and to edi/rdi , which holds an object pointer ( this ). 所以优化器可以做任何想做的事情:这里它决定省略对寄存器esi/rsi赋值,它用于将参数传递给TNode::Assemble(TNode const*) ,并传递给edi/rdi ,它保存一个对象指针( this )。 As a result, you see only a bunch of call instructions. 因此,您只能看到一堆call说明。 Try to value-initialize x (this will zero-initialize NodePtr ), 尝试对x进行值初始化 (这将零初始化NodePtr ),

     T x{}; 

    and you'll get much more meaningful assembly. 你会得到更有意义的装配。

  2. Clang seems to be better at loop unrolling. Clang似乎更适合循环展开。 See, eg, this answer . 参见,例如, 这个答案 It is up to you to decide whether loops are worth unrolling. 您可以自行决定循环是否值得展开。 For small loops, probably, they are. 对于小循环,它们可能是。 But you should measure. 但你应该衡量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM