[英]Why VS C++ 2017 compiler use SSE optimization only if iterated pointers are not stored in structure?
动机很简单:我遍历两个 uint64 arrays、位及其值并将结果存储在第三个数组中。 然而,在我推广以结构形式接受输入的解决方案后,代码速度大大减慢,我发现生成的程序集完全不同。 我的问题是为什么会这样,我该如何防止这种行为? (有关动机的更多详细信息,请参阅问题末尾的注释。)
最小化的代码如下所示:
struct FAndExpressionIter
{
public:
const uint64* LhsIter = nullptr;
const uint64* RhsIter = nullptr;
inline void operator++()
{
++LhsIter;
++RhsIter;
}
inline uint64 operator*() const
{
return (*LhsIter & *RhsIter);
}
};
void FooBitAndExpression(FAndExpressionIter SrcIter, uint64* ResultIter, const uint64* ResultEnd)
{
while (ResultIter != ResultEnd)
{
*ResultIter = *SrcIter;
++SrcIter;
++ResultIter;
}
}
void FooBitAndRaw(const uint64* LhsIter, const uint64* RhsIter, uint64* ResultIter, const uint64* ResultEnd)
{
while (ResultIter != ResultEnd)
{
*ResultIter = (*LhsIter & *RhsIter);
++LhsIter;
++RhsIter;
++ResultIter;
}
}
FooBitAndExpression
被编译成:
00007FFA37075910 mov qword ptr [rsp+8],rbx
7: while (ResultIter != ResultEnd)
00007FFA37075915 xor r10d,r10d
00007FFA37075918 mov rbx,r8
00007FFA3707591B sub rbx,rdx
00007FFA3707591E mov r9,rdx
00007FFA37075921 add rbx,7
00007FFA37075925 mov r11,rcx
00007FFA37075928 shr rbx,3
00007FFA3707592C cmp rdx,r8
00007FFA3707592F cmova rbx,r10
00007FFA37075933 test rbx,rbx
00007FFA37075936 je FooBitAndExpression+55h (07FFA37075965h)
00007FFA37075938 mov rax,qword ptr [rcx+8]
00007FFA3707593C nop dword ptr [rax]
8: {
9: *ResultIter = *SrcIter;
00007FFA37075940 mov rdx,qword ptr [r11]
10: ++SrcIter;
00007FFA37075943 lea rax,[rax+8]
11: ++ResultIter;
00007FFA37075947 inc r10
00007FFA3707594A lea r9,[r9+8]
8: {
9: *ResultIter = *SrcIter;
00007FFA3707594E mov rcx,qword ptr [rdx]
00007FFA37075951 and rcx,qword ptr [rax-8]
00007FFA37075955 mov qword ptr [r9-8],rcx
10: ++SrcIter;
00007FFA37075959 lea rcx,[rdx+8]
00007FFA3707595D mov qword ptr [r11],rcx
7: while (ResultIter != ResultEnd)
00007FFA37075960 cmp r10,rbx
00007FFA37075963 jne FooBitAndExpression+30h (07FFA37075940h)
12: }
13: }
00007FFA37075965 mov rbx,qword ptr [rsp+8]
12: }
13: }
00007FFA3707596A ret
而FooBitAndRaw
被编译成:
17: while (ResultIter != ResultEnd)
00007FFA37075980 xor r10d,r10d
00007FFA37075983 mov r11,r9
00007FFA37075986 sub r11,r8
00007FFA37075989 add r11,7
00007FFA3707598D shr r11,3
00007FFA37075991 cmp r8,r9
00007FFA37075994 cmova r11,r10
00007FFA37075998 test r11,r11
00007FFA3707599B je FooBitAndRaw+0F7h (07FFA37075A77h)
00007FFA370759A1 cmp r11,8
00007FFA370759A5 jb FooBitAndRaw+0D2h (07FFA37075A52h)
00007FFA370759AB lea rax,[rdx-8]
00007FFA370759AF lea rax,[rax+r11*8]
00007FFA370759B3 lea r9,[r8-8]
00007FFA370759B7 lea r9,[r9+r11*8]
00007FFA370759BB cmp r8,rax
00007FFA370759BE ja FooBitAndRaw+49h (07FFA370759C9h)
00007FFA370759C0 cmp r9,rdx
00007FFA370759C3 jae FooBitAndRaw+0D2h (07FFA37075A52h)
00007FFA370759C9 lea rax,[rcx-8]
00007FFA370759CD lea rax,[rax+r11*8]
00007FFA370759D1 cmp r8,rax
00007FFA370759D4 ja FooBitAndRaw+5Bh (07FFA370759DBh)
00007FFA370759D6 cmp r9,rcx
00007FFA370759D9 jae FooBitAndRaw+0D2h (07FFA37075A52h)
00007FFA370759DB mov rax,r11
00007FFA370759DE and rax,0FFFFFFFFFFFFFFF8h
00007FFA370759E2 nop dword ptr [rax]
00007FFA370759E6 nop word ptr [rax+rax]
18: {
19: *ResultIter = (*LhsIter & *RhsIter);
00007FFA370759F0 movdqu xmm0,xmmword ptr [rdx]
20: ++LhsIter;
21: ++RhsIter;
22: ++ResultIter;
00007FFA370759F4 add r10,8
00007FFA370759F8 movdqu xmm1,xmmword ptr [rcx]
00007FFA370759FC pand xmm1,xmm0
00007FFA37075A00 movdqu xmm0,xmmword ptr [rdx+10h]
00007FFA37075A05 movdqu xmmword ptr [r8],xmm1
00007FFA37075A0A movdqu xmm1,xmmword ptr [rcx+10h]
00007FFA37075A0F pand xmm1,xmm0
00007FFA37075A13 movdqu xmm0,xmmword ptr [rdx+20h]
00007FFA37075A18 movdqu xmmword ptr [r8+10h],xmm1
00007FFA37075A1E movdqu xmm1,xmmword ptr [rcx+20h]
00007FFA37075A23 pand xmm1,xmm0
00007FFA37075A27 movdqu xmm0,xmmword ptr [rdx+30h]
00007FFA37075A2C add rdx,40h
00007FFA37075A30 movdqu xmmword ptr [r8+20h],xmm1
00007FFA37075A36 movdqu xmm1,xmmword ptr [rcx+30h]
00007FFA37075A3B add rcx,40h
00007FFA37075A3F pand xmm1,xmm0
00007FFA37075A43 movdqu xmmword ptr [r8+30h],xmm1
00007FFA37075A49 add r8,40h
00007FFA37075A4D cmp r10,rax
00007FFA37075A50 jne FooBitAndRaw+70h (07FFA370759F0h)
17: while (ResultIter != ResultEnd)
00007FFA37075A52 cmp r10,r11
00007FFA37075A55 je FooBitAndRaw+0F7h (07FFA37075A77h)
00007FFA37075A57 sub rcx,rdx
17: while (ResultIter != ResultEnd)
00007FFA37075A5A sub r8,rdx
00007FFA37075A5D nop dword ptr [rax]
18: {
19: *ResultIter = (*LhsIter & *RhsIter);
00007FFA37075A60 mov rax,qword ptr [rcx+rdx]
20: ++LhsIter;
21: ++RhsIter;
22: ++ResultIter;
00007FFA37075A64 inc r10
00007FFA37075A67 and rax,qword ptr [rdx]
00007FFA37075A6A mov qword ptr [r8+rdx],rax
00007FFA37075A6E lea rdx,[rdx+8]
00007FFA37075A72 cmp r10,r11
00007FFA37075A75 jne FooBitAndRaw+0E0h (07FFA37075A60h)
23: }
24: }
00007FFA37075A77 ret
请注意,我对汇编程序的了解非常有限,我的结论是基于猜测。 如果我正确理解生成的程序集,则FooBitAndRaw
function 使用 sse 指令循环遍历迭代数据。 当剩余空间太小时,它会退回到对单个值的迭代。
然而, FooBitAndExpression
function 并没有这样做,而是立即迭代单个值。 据我了解,除了两个内联的方法调用外,两种实现在高级指令中都是相同的。
我已经发现C# 的 JIT 不会对用于迭代的局部结构变量中的变量使用寄存器,这会大大降低应用程序的速度。 显然,MS Visual C++ 在类似情况下使用寄存器,但它让我想到一个子问题:如果变量存储在本地结构中,是否有类似的限制不允许使用 sse 优化?
更多细节和注意事项:
FooBitAndExpression
更快,而FooBitAndRaw
速度大大降低,这是出乎意料的,所以我进行了调查 - 并发现了这一点。FooBitAndRaw
)。 但是,代码变得不那么可读了,所以现在我想在不影响性能的情况下返回一些可读性。 我打算将一些源数据和操作封装在结构中。 目前,使用优化位操作的代码必须在每行代码中注释以描述意图,因为从需要 6+ arguments 的 function 调用中并不清楚。FooBitAndExpression
function 的原因。 下一步是将其更改为模板以支持各种输入表达式。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.