通过查看装配将值传递与引用传递性能进行比较

Question

I want to test a function in order to verify which one is faster, pass-by-value or pass-by-reference 我想测试一个函数，以验证哪个是更快的，按值传递或按引用传递

Here is my test case : https://godbolt.org/g/cjaEx3 这是我的测试用例： https : //godbolt.org/g/cjaEx3

code : 代码：

struct Vec4f
{
  float val[4];
};


Vec4f suma(const Vec4f& a, const Vec4f& b)
{
  return {a.val[0] + b.val[0], 
          a.val[1] + b.val[1],
          a.val[2] + b.val[2],
          a.val[3] + b.val[3]};
}

Vec4f sumb(Vec4f a, Vec4f b)
{
  return {a.val[0] + b.val[0], 
          a.val[1] + b.val[1],
          a.val[2] + b.val[2],
          a.val[3] + b.val[3]};
}

assembly's output on x86-64 clang using -O3 -std=c++14 : 使用-O3 -std=c++14在x86-64 clang上的程序集输出：

suma(Vec4f const&, Vec4f const&):                     # @suma(Vec4f const&, Vec4f const&)
        movq    xmm1, qword ptr [rdi]   # xmm1 = mem[0],zero
        movq    xmm0, qword ptr [rsi]   # xmm0 = mem[0],zero
        addps   xmm0, xmm1
        movq    xmm2, qword ptr [rdi + 8] # xmm2 = mem[0],zero
        movq    xmm1, qword ptr [rsi + 8] # xmm1 = mem[0],zero
        addps   xmm1, xmm2
        ret

sumb(Vec4f, Vec4f):                        # @sumb(Vec4f, Vec4f)
        addps   xmm0, xmm2
        addps   xmm1, xmm3
        ret

It turns out on gcc, clang, and msvc that passing by value produces fewer assembly in this particular case. 事实证明，在这种特殊情况下，通过gcc，clang和msvc传递值会产生较少的汇编。

My questions are : 我的问题是：

Is comparing assembly line count generally is a good heuristic for comparing performance of simple functions like these? 通常比较流水线计数是比较此类简单功能性能的一种很好的方法吗？

and also as I don't really understand the assembly output 而且因为我不太了解程序集的输出

Can you explain the assembly output of both suma and sumb function? 您能解释suma和sumb函数的汇编输出吗？

Interestingly, if I change Vec4f to have float val[40] instead, both functions produce the same assembly output. 有趣的是，如果我将Vec4f更改为具有float val[40] ，则两个函数都会产生相同的程序集输出。 So, 所以，

What's the reason of the initial assembly difference? 初始装配差异的原因是什么？

Answer 1

1) No. Not all instructions execute in the same amount of time, and once memory needs to be accessed there can be a large latency. 1）否。并非所有指令都在相同的时间内执行，一旦需要访问内存，就会有很大的延迟。

2) and 3). 2）和3）。 suma needs to load the contents of a and b into appropriate registers. suma需要将a和b的内容加载到适当的寄存器中。 In sumb , those values are passed to the function already in the registers. 在sumb ，这些值传递给在已注册的功能。 In some cases, the register loading in suma will be done by sumb 's caller. 在某些情况下， suma的寄存器加载将由sumb的调用者完成。 In other cases, the values may already be in registers, and the suma caller will first need to store those values in memory so that it can create references to them. 在其他情况下，这些值可能已经在寄存器中，并且suma调用程序将首先需要将这些值存储在内存中，以便可以创建对它们的引用。

When you use float val[40] that exceeds the capacity for passing values by register, so both functions will need to load the data from memory first (in suma , by dereferencing the reference; in sumb by loading the values off the stack). 当使用的float val[40]超出了通过寄存器传递值的能力时，因此这两个函数都需要先从内存中加载数据（以suma ，通过取消引用；以sumb方式将值从堆栈中加载）。

Answer 2

1) Maybe this can be used as a heuristic, but it cannot be trusted at all. 1）也许可以将其用作启发式方法，但根本无法信任它。 For example, a simple div instruction can be slower than 20 simple instructions. 例如，简单的div指令可能比20条简单的指令慢。 So I wouldn't bother looking at instruction counts at all. 因此，我根本不会理会指令计数。

2), 3) 2），3）

Here's a little explanation the assembly you listed: 这是您列出的程序集的一些解释：

clang only uses half of the vector registers (xmmX can contain 4 float values, but clang only uses 2). clang仅使用向量寄存器的一半（xmmX可以包含4个浮点值，但clang仅使用2个值）。 Maybe it is because of calling conventions. 也许是因为调用约定。

// this function has two reference parameters
// register rdi points to the first parameter (points to, so it is not the value of it, but a pointer)
// register rsi points to the second parameter
// register xmm0, xmm1 contains the result
suma(Vec4f const&, Vec4f const&):
        movq    xmm1, qword ptr [rdi]   # xmm1 will contain the first 2 floats of the first parameter
        movq    xmm0, qword ptr [rsi]   # xmm0 will contain the first 2 floats of the second parameter
        addps   xmm0, xmm1              # let's add them together, xmm0 contains the result
        movq    xmm2, qword ptr [rdi + 8] # xmm2 will contain the second 2 floats of the first parameter
        movq    xmm1, qword ptr [rsi + 8] # xmm1 will contain the second 2 floats of the second parameter
        addps   xmm1, xmm2              # let's add them together, xmm1 contains the result
        ret

// this function has to parameters
// first is passed in xmm0 and xmm1
// seconds is passed in xmm2 and xmm3
// register xmm0, xmm1 contains the result
sumb(Vec4f, Vec4f):
        addps   xmm0, xmm2
        addps   xmm1, xmm3
        ret

if I change Vec4f to have float val[40] instead, both functions produce the same assembly output. 如果我将Vec4f更改为具有float val[40] ，则两个函数将产生相同的程序集输出。

This is false . 这是错误的 。 They don't. 他们没有。 They seem to be the same at first sight, but they are not. 乍一看，它们似乎是相同的，但事实并非如此。

There's code in both functions that is the same: because you return a float[40] , which has a lot of zero members, there should be code in both functions that zeros these elements. 这两个函数中的代码是相同的：因为您返回的float[40]具有很多零成员，所以两个函数中都应该有将这些元素归零的代码。 You see that code, and it is the same. 您会看到该代码，并且相同。 The other parts differ. 其他部分有所不同。

通过查看装配将值传递与引用传递性能进行比较

问题描述

2 个解决方案

解决方案1
4 已采纳 2017-07-17 03:54:32

解决方案2
1 2017-07-17 10:51:59

通过查看装配将值传递与引用传递性能进行比较

问题描述

2 个解决方案

解决方案1 4 已采纳 2017-07-17 03:54:32

解决方案2 1 2017-07-17 10:51:59

解决方案1
4 已采纳 2017-07-17 03:54:32

解决方案2
1 2017-07-17 10:51:59