[英]Comparing pass-by-value with pass-by-reference performance by looking at assembly
I want to test a function in order to verify which one is faster, pass-by-value or pass-by-reference 我想测试一个函数,以验证哪个是更快的,按值传递或按引用传递
Here is my test case : https://godbolt.org/g/cjaEx3 这是我的测试用例: https : //godbolt.org/g/cjaEx3
code : 代码:
struct Vec4f
{
float val[4];
};
Vec4f suma(const Vec4f& a, const Vec4f& b)
{
return {a.val[0] + b.val[0],
a.val[1] + b.val[1],
a.val[2] + b.val[2],
a.val[3] + b.val[3]};
}
Vec4f sumb(Vec4f a, Vec4f b)
{
return {a.val[0] + b.val[0],
a.val[1] + b.val[1],
a.val[2] + b.val[2],
a.val[3] + b.val[3]};
}
assembly's output on x86-64 clang using -O3 -std=c++14
: 使用
-O3 -std=c++14
在x86-64 clang上的程序集输出:
suma(Vec4f const&, Vec4f const&): # @suma(Vec4f const&, Vec4f const&)
movq xmm1, qword ptr [rdi] # xmm1 = mem[0],zero
movq xmm0, qword ptr [rsi] # xmm0 = mem[0],zero
addps xmm0, xmm1
movq xmm2, qword ptr [rdi + 8] # xmm2 = mem[0],zero
movq xmm1, qword ptr [rsi + 8] # xmm1 = mem[0],zero
addps xmm1, xmm2
ret
sumb(Vec4f, Vec4f): # @sumb(Vec4f, Vec4f)
addps xmm0, xmm2
addps xmm1, xmm3
ret
It turns out on gcc, clang, and msvc that passing by value produces fewer assembly in this particular case. 事实证明,在这种特殊情况下,通过gcc,clang和msvc传递值会产生较少的汇编。
My questions are : 我的问题是:
and also as I don't really understand the assembly output 而且因为我不太了解程序集的输出
suma
and sumb
function? suma
和sumb
函数的汇编输出吗? Interestingly, if I change Vec4f to have float val[40]
instead, both functions produce the same assembly output. 有趣的是,如果我将Vec4f更改为具有
float val[40]
,则两个函数都会产生相同的程序集输出。 So, 所以,
1) No. Not all instructions execute in the same amount of time, and once memory needs to be accessed there can be a large latency. 1)否。并非所有指令都在相同的时间内执行,一旦需要访问内存,就会有很大的延迟。
2) and 3). 2)和3)。
suma
needs to load the contents of a
and b
into appropriate registers. suma
需要将a
和b
的内容加载到适当的寄存器中。 In sumb
, those values are passed to the function already in the registers. 在
sumb
,这些值传递给在已注册的功能。 In some cases, the register loading in suma
will be done by sumb
's caller. 在某些情况下,
suma
的寄存器加载将由sumb
的调用者完成。 In other cases, the values may already be in registers, and the suma
caller will first need to store those values in memory so that it can create references to them. 在其他情况下,这些值可能已经在寄存器中,并且
suma
调用程序将首先需要将这些值存储在内存中,以便可以创建对它们的引用。
When you use float val[40]
that exceeds the capacity for passing values by register, so both functions will need to load the data from memory first (in suma
, by dereferencing the reference; in sumb
by loading the values off the stack). 当使用的
float val[40]
超出了通过寄存器传递值的能力时,因此这两个函数都需要先从内存中加载数据(以suma
,通过取消引用;以sumb
方式将值从堆栈中加载)。
1) Maybe this can be used as a heuristic, but it cannot be trusted at all. 1)也许可以将其用作启发式方法,但根本无法信任它。 For example, a simple
div
instruction can be slower than 20 simple instructions. 例如,简单的
div
指令可能比20条简单的指令慢。 So I wouldn't bother looking at instruction counts at all. 因此,我根本不会理会指令计数。
2), 3) 2),3)
Here's a little explanation the assembly you listed: 这是您列出的程序集的一些解释:
clang only uses half of the vector registers (xmmX can contain 4 float values, but clang only uses 2). clang仅使用向量寄存器的一半(xmmX可以包含4个浮点值,但clang仅使用2个值)。 Maybe it is because of calling conventions.
也许是因为调用约定。
// this function has two reference parameters
// register rdi points to the first parameter (points to, so it is not the value of it, but a pointer)
// register rsi points to the second parameter
// register xmm0, xmm1 contains the result
suma(Vec4f const&, Vec4f const&):
movq xmm1, qword ptr [rdi] # xmm1 will contain the first 2 floats of the first parameter
movq xmm0, qword ptr [rsi] # xmm0 will contain the first 2 floats of the second parameter
addps xmm0, xmm1 # let's add them together, xmm0 contains the result
movq xmm2, qword ptr [rdi + 8] # xmm2 will contain the second 2 floats of the first parameter
movq xmm1, qword ptr [rsi + 8] # xmm1 will contain the second 2 floats of the second parameter
addps xmm1, xmm2 # let's add them together, xmm1 contains the result
ret
// this function has to parameters
// first is passed in xmm0 and xmm1
// seconds is passed in xmm2 and xmm3
// register xmm0, xmm1 contains the result
sumb(Vec4f, Vec4f):
addps xmm0, xmm2
addps xmm1, xmm3
ret
if I change
Vec4f
to havefloat val[40]
instead, both functions produce the same assembly output.如果我将
Vec4f
更改为具有float val[40]
,则两个函数将产生相同的程序集输出。
This is false . 这是错误的 。 They don't.
他们没有。 They seem to be the same at first sight, but they are not.
乍一看,它们似乎是相同的,但事实并非如此。
There's code in both functions that is the same: because you return a float[40]
, which has a lot of zero members, there should be code in both functions that zeros these elements. 这两个函数中的代码是相同的:因为您返回的
float[40]
具有很多零成员,所以两个函数中都应该有将这些元素归零的代码。 You see that code, and it is the same. 您会看到该代码,并且相同。 The other parts differ.
其他部分有所不同。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.