将rvalue传递给非ref参数，为什么编译器不能复制？

Question

struct Big {
    int a[8];
};
void foo(Big a);
Big getStuff();
void test1() {
    foo(getStuff());
}

compiles (using clang 6.0.0 for x86_64 on Linux so System V ABI, flags: -O3 -march=broadwell ) to 编译（在Linux上使用clang 6.0.0 for x86_64，所以System V ABI，标志： -O3 -march=broadwell ）到

test1():                              # @test1()
        sub     rsp, 72
        lea     rdi, [rsp + 40]
        call    getStuff()
        vmovups ymm0, ymmword ptr [rsp + 40]
        vmovups ymmword ptr [rsp], ymm0
        vzeroupper
        call    foo(Big)
        add     rsp, 72
        ret

If I am reading this correctly, this is what is happening: 如果我正确地阅读这个，那就是正在发生的事情：

getStuff is passed a pointer to foo 's stack ( rsp + 40 ) to use for its return value, so after getStuff returns rsp + 40 through to rsp + 71 contains the result of getStuff . getStuff传递一个指向foo的堆栈（ rsp + 40 ）的指针用于返回值，所以在getStuff返回getStuff rsp + 40到getStuff rsp + 71包含getStuff的结果。
This result is then immediately copied to a lower stack address rsp through to rsp + 31 . 然后立即将该结果复制到较低的堆栈地址rsp到rsp + 31 。
foo is then called, which will read its argument from rsp . 然后调用foo ，它将从rsp读取其参数。

Why is the following code not totally equivalent (and why doesn't the compiler generate it instead)? 为什么以下代码不完全等效（为什么编译器不会生成它）？

test1():                              # @test1()
        sub     rsp, 32
        mov     rdi, rsp
        call    getStuff()
        call    foo(Big)
        add     rsp, 32
        ret

The idea is: have getStuff write directly to the place in the stack that foo will read from. 这个想法是：让getStuff直接写入foo将读取的堆栈中的位置。

Also: Here is the result for the same code (with 12 ints instead of 8) compiled by vc++ on windows for x64, which seems even worse because the windows x64 ABI passes and returns by reference, so the copy is completely unused! 另外：这是在Windows for x64上由vc ++编译的相同代码（12个int而不是8个）的结果，这看起来更糟，因为windows x64 ABI通过并通过引用返回，因此副本完全未使用！

_TEXT   SEGMENT
$T3 = 32
$T1 = 32
?bar@@YAHXZ PROC                    ; bar, COMDAT

$LN4:
    sub rsp, 88                 ; 00000058H

    lea rcx, QWORD PTR $T1[rsp]
    call    ?getStuff@@YA?AUBig@@XZ         ; getStuff
    lea rcx, QWORD PTR $T3[rsp]
    movups  xmm0, XMMWORD PTR [rax]
    movaps  XMMWORD PTR $T3[rsp], xmm0
    movups  xmm1, XMMWORD PTR [rax+16]
    movaps  XMMWORD PTR $T3[rsp+16], xmm1
    movups  xmm0, XMMWORD PTR [rax+32]
    movaps  XMMWORD PTR $T3[rsp+32], xmm0
    call    ?foo@@YAHUBig@@@Z           ; foo

    add rsp, 88                 ; 00000058H
    ret 0

Answer 1

You're right; 你是对的; this looks like a missed-optimization by the compiler . 这看起来像编译器的遗漏优化 。 You can report this bug ( https://bugs.llvm.org/ ) if there isn't already a duplicate. 如果还没有重复，您可以报告此错误（ https://bugs.llvm.org/ ）。

Contrary to popular belief, compilers often don't make optimal code. 与流行的看法相反，编译器通常不会制作最佳代码。 It's often good enough, and modern CPUs are quite good at plowing through excess instructions when they don't lengthen dependency chains too much, especially the critical path dependency chain if there is one. 它通常足够好，并且现代CPU在不过多地延长依赖链时会非常擅长翻阅过多的指令，尤其是关键路径依赖链（如果有的话）。

x86-64 SysV passes large structs by value on the stack if they don't fit packed into two 64-bit integer registers, and them returns via hidden pointer. x86-64 SysV通过堆栈上的值传递大型结构，如果它们不适合打包到两个64位整数寄存器中，并且它们通过隐藏指针返回。 The compiler can and should (but doesn't) plan ahead and reuse the return value temporary as the stack-args for the call to foo(Big) . 编译器可以而且应该（但不）提前计划并将返回值临时重用为foo(Big)调用的stack-args。

gcc7.3, ICC18, and MSVC CL19 also miss this optimization. gcc7.3，ICC18和MSVC CL19也错过了这种优化。 :/ I put your code up on the Godbolt compiler explorer with gcc/clang/ICC/MSVC . ：/我用Gcc / clang / ICC / MSVC将你的代码放在Godbolt编译器资源管理器上。 gcc uses 4x push qword [rsp+24] to copy, while ICC uses extra instructions to align the stack by 32. gcc使用4x push qword [rsp+24]进行复制，而ICC使用额外的指令将堆栈对齐32。

Using 1x 32-byte load/store instead of 2x 16-byte probably doesn't justify the cost of the vzeroupper for MSVC / ICC / clang, for a function this small. 对于MSVC / ICC / clang，使用1x 32字节加载/存储而不是2x 16字节可能无法证明vzeroupper的成本，因为这个函数很小。 vzeroupper is cheap on mainstream Intel CPUs (only 4 uops), and I did use -march=haswell to tune for that, not for AMD or KNL where it's more expensive. vzeroupper在主流Intel CPU（仅4 vzeroupper很便宜，而且我确实使用-march=haswell来调整它，而不是AMD或KNL，它更贵。

Related: x86-64 Windows passes large structs by hidden pointer, as well as returning them that way. 相关：x86-64 Windows通过隐藏指针传递大型结构，并以这种方式返回它们。 The callee owns the pointed-to memory. 被调用者拥有指向的内存。 ( What happens at assembly level when you have functions with large inputs ) （当您具有大输入的函数时，在汇编级别会发生什么）

This optimization would still be available by simply reserving space for the temporary + shadow-space before the first call to getStuff() , and allowing the callee to destroy the temporary because we don't need it later. 在第一次调用getStuff()之前，只需为临时+阴影空间保留空间，并允许被调用者销毁临时文件，因为我们以后不再需要它，因此仍然可以使用此优化。

That's not actually what MSVC does here or in related cases, though, unfortunately. 不幸的是，这实际上并不是MSVC在这里或相关案例中所做的。

See also @BeeOnRope's answer, and my comments onit, on Why isn't pass struct by reference a common optimization? 另见@ BeeOnRope的答案，以及我的评论，关于为什么不通过引用传递struct一个常见的优化？ . 。 Making sure the copy-constructor can always run at a sane place for non-trivially-copyable objects is problematic if you're trying to design a calling convention that avoids copying by passing by hidden const-reference (caller owns the memory, callee can copy if needed). 如果你试图通过传递隐藏的const-reference来设计一个避免复制的调用约定，那么确保copy-constructor总能在一个理想的位置运行非平凡可复制的对象是有问题的（调用者拥有内存，被调用者可以如果需要复制）。

But this is an example of a case where non-const reference (callee owns the memory) is best, because the caller wants to hand off the object to the callee. 但这是一个非const引用（被调用者拥有内存）最好的情况的例子，因为调用者想要将对象移交给被调用者。

There's a potential gotcha, though: if there are any pointers to this object, letting the callee use it directly could introduce bugs . 但是有一个潜在的问题： 如果有任何指向此对象的指针，让被调用者直接使用它可能会引入错误 。 Consider some other function that does global_pointer->a[4]=0; 考虑一些其他函数，它执行global_pointer->a[4]=0; . 。 If our callee calls that function, it will unexpectedly modify our callee's by-value arg. 如果我们的被调用者调用该函数，它将意外地修改我们的被调用者的按值arg。

So letting the callee destroy our copy of the object in the Windows x64 calling convention only works if escape analysis can prove that nothing else has a pointer to this object. 因此，如果转义分析可以证明没有其他任何指针指向此对象，那么让被调用者在Windows x64调用约定中销毁该对象的副本是有效的。

将rvalue传递给非ref参数，为什么编译器不能复制？

问题描述

1 个解决方案

解决方案1
3 已采纳 2018-03-25 10:59:02

将rvalue传递给非ref参数，为什么编译器不能复制？

问题描述

1 个解决方案

解决方案1 3 已采纳 2018-03-25 10:59:02

解决方案1
3 已采纳 2018-03-25 10:59:02