循环展开 - G ++与Clang ++

Question

I was wondering whether it is worth to aid the compiler with templates to unroll a simple loop. 我想知道是否值得帮助编译器使用模板来展开一个简单的循环。 I prepared the following test: 我准备了以下测试：

#include <cstdlib>
#include <utility>
#include <array>

class TNode
{
public:
  void Assemble();
  void Assemble(TNode const *);
};

class T
{
private:
  std::array<TNode *,3u> NodePtr;

private:
  template <std::size_t,std::size_t>
  void foo() const;

  template <std::size_t... ij>
  void foo(std::index_sequence<ij...>) const
    { (foo<ij%3u,ij/3u>(),...); }

public:
  void foo() const
    { return foo(std::make_index_sequence<3u*3u>{}); }

  void bar() const;
};

template <std::size_t i,std::size_t j>
inline void T::foo() const
{
if constexpr (i==j)
  NodePtr[i]->Assemble();
else
  NodePtr[i]->Assemble(NodePtr[j]);
}

inline void T::bar() const
{
for (std::size_t i= 0u; i<3u; ++i)
  for (std::size_t j= 0u; j<3u; ++j)
    if (i==j)
      NodePtr[i]->Assemble();
    else
      NodePtr[i]->Assemble(NodePtr[j]);
}

void foo()
{
T x;
x.foo();
}

void bar()
{
T x;
x.bar();
}

I first tried with G++ with -O3 -funroll-loops enabled and I got ( https://godbolt.org/z/_Wyvl8 ): 我首先尝试使用G ++并启用了-O3 -funroll-loops然后我得到了（ https://godbolt.org/z/_Wyvl8 ）：

foo():
        push    r12
        push    rbp
        push    rbx
        sub     rsp, 32
        mov     r12, QWORD PTR [rsp]
        mov     rdi, r12
        call    TNode::Assemble()
        mov     rbp, QWORD PTR [rsp+8]
        mov     rsi, r12
        mov     rdi, rbp
        call    TNode::Assemble(TNode const*)
        mov     rbx, QWORD PTR [rsp+16]
        mov     rsi, r12
        mov     rdi, rbx
        call    TNode::Assemble(TNode const*)
        mov     rsi, rbp
        mov     rdi, r12
        call    TNode::Assemble(TNode const*)
        mov     rdi, rbp
        call    TNode::Assemble()
        mov     rsi, rbp
        mov     rdi, rbx
        call    TNode::Assemble(TNode const*)
        mov     rsi, rbx
        mov     rdi, r12
        call    TNode::Assemble(TNode const*)
        mov     rdi, rbp
        mov     rsi, rbx
        call    TNode::Assemble(TNode const*)
        add     rsp, 32
        mov     rdi, rbx
        pop     rbx
        pop     rbp
        pop     r12
        jmp     TNode::Assemble()
bar():
        push    r13
        push    r12
        push    rbp
        xor     ebp, ebp
        push    rbx
        sub     rsp, 40
.L9:
        mov     r13, QWORD PTR [rsp+rbp*8]
        xor     ebx, ebx
        lea     r12, [rbp+1]
.L5:
        cmp     rbp, rbx
        je      .L15
        mov     rsi, QWORD PTR [rsp+rbx*8]
        mov     rdi, r13
        add     rbx, 1
        call    TNode::Assemble(TNode const*)
        cmp     rbx, 3
        jne     .L5
        mov     rbp, r12
        cmp     r12, 3
        jne     .L9
.L16:
        add     rsp, 40
        pop     rbx
        pop     rbp
        pop     r12
        pop     r13
        ret
.L15:
        mov     rdi, r13
        mov     rbx, r12
        call    TNode::Assemble()
        cmp     r12, 3
        jne     .L5
        mov     rbp, r12
        cmp     r12, 3
        jne     .L9
        jmp     .L16

I can't read assembly, but I seem to understand that the templated version does unroll the loop, while bar has loops and branches. 我无法阅读汇编，但我似乎明白模板版本会展开循环，而bar有循环和分支。

Then I tried with Clang++ ( https://godbolt.org/z/VCNb65 ) and I got a very different picture: 然后我尝试使用Clang ++（ https://godbolt.org/z/VCNb65 ），我得到了一个非常不同的图片：

foo():                                # @foo()
        push    rax
        call    TNode::Assemble()
        call    TNode::Assemble(TNode const*)
        call    TNode::Assemble(TNode const*)
        call    TNode::Assemble(TNode const*)
        call    TNode::Assemble()
        call    TNode::Assemble(TNode const*)
        call    TNode::Assemble(TNode const*)
        call    TNode::Assemble(TNode const*)
        pop     rax
        jmp     TNode::Assemble()    # TAILCALL
bar():                                # @bar()
        push    rax
        call    TNode::Assemble()
        call    TNode::Assemble(TNode const*)
        call    TNode::Assemble(TNode const*)
        call    TNode::Assemble(TNode const*)
        call    TNode::Assemble()
        call    TNode::Assemble(TNode const*)
        call    TNode::Assemble(TNode const*)
        call    TNode::Assemble(TNode const*)
        pop     rax
        jmp     TNode::Assemble()    # TAILCALL

What happened here? 这里发生了什么？ How can the resulting assembly be so terse? 最终的装配如何如此简洁？

Answer 1

NodePtr is not initialized, and when you use it, it is UB. NodePtr未初始化，当您使用它时，它是UB。 So the optimizer can do whatever it wants: here it decides to omit assignments to the register esi/rsi , which is used to pass an argument to TNode::Assemble(TNode const*) , and to edi/rdi , which holds an object pointer ( this ). 所以优化器可以做任何想做的事情：这里它决定省略对寄存器esi/rsi赋值，它用于将参数传递给TNode::Assemble(TNode const*) ，并传递给edi/rdi ，它保存一个对象指针（ this ）。 As a result, you see only a bunch of call instructions. 因此，您只能看到一堆call说明。 Try to value-initialize x (this will zero-initialize NodePtr ), 尝试对x进行值初始化（这将零初始化NodePtr ），
```
 T x{}; 
```
and you'll get much more meaningful assembly. 你会得到更有意义的装配。
Clang seems to be better at loop unrolling. Clang似乎更适合循环展开。 See, eg, this answer . 参见，例如，这个答案。 It is up to you to decide whether loops are worth unrolling. 您可以自行决定循环是否值得展开。 For small loops, probably, they are. 对于小循环，它们可能是。 But you should measure. 但你应该衡量。

循环展开 - G ++与Clang ++

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-10-30 07:42:31

循环展开 - G ++与Clang ++

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-10-30 07:42:31

解决方案1
2 已采纳 2018-10-30 07:42:31