[英]Loop unrolling - G++ vs. Clang++
I was wondering whether it is worth to aid the compiler with templates to unroll a simple loop. 我想知道是否值得帮助编译器使用模板来展开一个简单的循环。 I prepared the following test:
我准备了以下测试:
#include <cstdlib>
#include <utility>
#include <array>
class TNode
{
public:
void Assemble();
void Assemble(TNode const *);
};
class T
{
private:
std::array<TNode *,3u> NodePtr;
private:
template <std::size_t,std::size_t>
void foo() const;
template <std::size_t... ij>
void foo(std::index_sequence<ij...>) const
{ (foo<ij%3u,ij/3u>(),...); }
public:
void foo() const
{ return foo(std::make_index_sequence<3u*3u>{}); }
void bar() const;
};
template <std::size_t i,std::size_t j>
inline void T::foo() const
{
if constexpr (i==j)
NodePtr[i]->Assemble();
else
NodePtr[i]->Assemble(NodePtr[j]);
}
inline void T::bar() const
{
for (std::size_t i= 0u; i<3u; ++i)
for (std::size_t j= 0u; j<3u; ++j)
if (i==j)
NodePtr[i]->Assemble();
else
NodePtr[i]->Assemble(NodePtr[j]);
}
void foo()
{
T x;
x.foo();
}
void bar()
{
T x;
x.bar();
}
I first tried with G++ with -O3 -funroll-loops
enabled and I got ( https://godbolt.org/z/_Wyvl8 ): 我首先尝试使用G ++并启用了
-O3 -funroll-loops
然后我得到了( https://godbolt.org/z/_Wyvl8 ):
foo():
push r12
push rbp
push rbx
sub rsp, 32
mov r12, QWORD PTR [rsp]
mov rdi, r12
call TNode::Assemble()
mov rbp, QWORD PTR [rsp+8]
mov rsi, r12
mov rdi, rbp
call TNode::Assemble(TNode const*)
mov rbx, QWORD PTR [rsp+16]
mov rsi, r12
mov rdi, rbx
call TNode::Assemble(TNode const*)
mov rsi, rbp
mov rdi, r12
call TNode::Assemble(TNode const*)
mov rdi, rbp
call TNode::Assemble()
mov rsi, rbp
mov rdi, rbx
call TNode::Assemble(TNode const*)
mov rsi, rbx
mov rdi, r12
call TNode::Assemble(TNode const*)
mov rdi, rbp
mov rsi, rbx
call TNode::Assemble(TNode const*)
add rsp, 32
mov rdi, rbx
pop rbx
pop rbp
pop r12
jmp TNode::Assemble()
bar():
push r13
push r12
push rbp
xor ebp, ebp
push rbx
sub rsp, 40
.L9:
mov r13, QWORD PTR [rsp+rbp*8]
xor ebx, ebx
lea r12, [rbp+1]
.L5:
cmp rbp, rbx
je .L15
mov rsi, QWORD PTR [rsp+rbx*8]
mov rdi, r13
add rbx, 1
call TNode::Assemble(TNode const*)
cmp rbx, 3
jne .L5
mov rbp, r12
cmp r12, 3
jne .L9
.L16:
add rsp, 40
pop rbx
pop rbp
pop r12
pop r13
ret
.L15:
mov rdi, r13
mov rbx, r12
call TNode::Assemble()
cmp r12, 3
jne .L5
mov rbp, r12
cmp r12, 3
jne .L9
jmp .L16
I can't read assembly, but I seem to understand that the templated version does unroll the loop, while bar
has loops and branches. 我无法阅读汇编,但我似乎明白模板版本会展开循环,而
bar
有循环和分支。
Then I tried with Clang++ ( https://godbolt.org/z/VCNb65 ) and I got a very different picture: 然后我尝试使用Clang ++( https://godbolt.org/z/VCNb65 ),我得到了一个非常不同的图片:
foo(): # @foo()
push rax
call TNode::Assemble()
call TNode::Assemble(TNode const*)
call TNode::Assemble(TNode const*)
call TNode::Assemble(TNode const*)
call TNode::Assemble()
call TNode::Assemble(TNode const*)
call TNode::Assemble(TNode const*)
call TNode::Assemble(TNode const*)
pop rax
jmp TNode::Assemble() # TAILCALL
bar(): # @bar()
push rax
call TNode::Assemble()
call TNode::Assemble(TNode const*)
call TNode::Assemble(TNode const*)
call TNode::Assemble(TNode const*)
call TNode::Assemble()
call TNode::Assemble(TNode const*)
call TNode::Assemble(TNode const*)
call TNode::Assemble(TNode const*)
pop rax
jmp TNode::Assemble() # TAILCALL
What happened here? 这里发生了什么? How can the resulting assembly be so terse?
最终的装配如何如此简洁?
NodePtr
is not initialized, and when you use it, it is UB. NodePtr
未初始化,当您使用它时,它是UB。 So the optimizer can do whatever it wants: here it decides to omit assignments to the register esi/rsi
, which is used to pass an argument to TNode::Assemble(TNode const*)
, and to edi/rdi
, which holds an object pointer ( this
). 所以优化器可以做任何想做的事情:这里它决定省略对寄存器
esi/rsi
赋值,它用于将参数传递给TNode::Assemble(TNode const*)
,并传递给edi/rdi
,它保存一个对象指针( this
)。 As a result, you see only a bunch of call
instructions. 因此,您只能看到一堆
call
说明。 Try to value-initialize x
(this will zero-initialize NodePtr
), 尝试对
x
进行值初始化 (这将零初始化NodePtr
),
T x{};
and you'll get much more meaningful assembly. 你会得到更有意义的装配。
Clang seems to be better at loop unrolling. Clang似乎更适合循环展开。 See, eg, this answer .
参见,例如, 这个答案 。 It is up to you to decide whether loops are worth unrolling.
您可以自行决定循环是否值得展开。 For small loops, probably, they are.
对于小循环,它们可能是。 But you should measure.
但你应该衡量。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.