[英]C++ - Apply std::exp to an std::vector
Is there a faster (from a performance perspective) way than simply do有没有比简单地更快(从性能角度)的方法
std::vector<double> y;
y.reserve(x.size());
for(size_t i = 0; i < x.size(); ++i)
y.push_back(std::exp(x[i]));
If you require maximum precision to the nearest ULP, this is likely as fast as you're going to get.如果您需要最接近的 ULP 的最大精度,这可能与您将要获得的一样快。
If you can accept some approximation errors, there are much faster methods that use SIMD .如果您可以接受一些近似误差,那么使用 SIMD 的方法要快得多。
push_back
, surprisingly, has a little bit of overhead, because it doesn't actually know that you've reserved enough space, so it always has to check.令人惊讶的是,
push_back
有一点开销,因为它实际上并不知道您预留了足够的空间,因此它总是需要检查。 Because this check can change control flow between loop iterations, push_back
precludes automatic vectorization by the compiler.由于此检查可以更改循环迭代之间的控制流,因此
push_back
阻止编译器进行自动矢量化。
Consider these two functions, where the first one uses push_back
, while the second one modifies a copy (or moved-into value) in-place:考虑这两个函数,第一个函数使用
push_back
,而第二个函数就地修改副本(或移入值):
auto exp1(std::vector<double> const& xs) -> std::vector<double> {
auto ys = std::vector<double>{};
ys.reserve(xs.size());
for(auto x : xs){ ys.push_back(std::exp(x)); }
}
auto exp2(std::vector<double> xs) -> std::vector<double> {
for(auto & x : xs){ x = std::exp(x); }
return xs;
}
We'll look at the assembly output , if compiled in GCC 9.1 with如果在 GCC 9.1 中编译,我们将查看程序集输出
gcc -std=c++17 -O3 -march=skylake-avx512
Here is exp1
's inner loop (embedded in quite a bit of additional code which will never be executed because you've already reserve
d):这是
exp1
的内部循环(嵌入了相当多的额外代码,这些代码永远不会被执行,因为您已经reserve
d):
.L45:
add rbx, 8
vmovsd QWORD PTR [r14], xmm0
add r14, 8
cmp r12, rbx
je .L44
.L18:
vmovsd xmm0, QWORD PTR [rbx]
call exp
vmovsd QWORD PTR [rsp], xmm0
cmp rbp, r14
jne .L45
And here's exp2
's:这是
exp2
的:
.L53:
vmovsd xmm0, QWORD PTR [rbx]
add rbx, 8
call exp
vmovsd QWORD PTR [rbx-8], xmm0
cmp rbp, rbx
jne .L53
In practice, they are basically the same, because exp
is complicated and GCC doesn't know how to automatically vectorize it.在实践中,它们基本相同,因为
exp
很复杂,而且 GCC 不知道如何自动对其进行矢量化。 However, consider the case where something much simpler happens in the inner loop:但是,请考虑在内循环中发生更简单的事情的情况:
auto sq1(std::vector<double> const& xs) -> std::vector<double> {
auto ys = std::vector<double>{};
ys.reserve(xs.size());
for(auto x : xs){ ys.push_back(x*x); }
}
auto sq2(std::vector<double> xs) -> std::vector<double> {
for(auto & x : xs){ x *= x; }
return xs;
}
Here's sq1
's inner loop:这是
sq1
的内部循环:
.L89:
vmovsd QWORD PTR [rsi], xmm0
add rbx, 8
add rsi, 8
mov QWORD PTR [rsp+24], rsi
cmp rbp, rbx
je .L72
.L75:
vmovsd xmm0, QWORD PTR [rbx]
mov rsi, QWORD PTR [rsp+24]
vmulsd xmm0, xmm0, xmm0
vmovsd QWORD PTR [rsp+8], xmm0
cmp rsi, QWORD PTR [rsp+32]
jne .L89
Here's sq2
's.这是
sq2
的。 Note that it uses vmulpd
and ymm
registers, and that it jumps by 32 bytes at a time rather than 8 at a time.请注意,它使用
vmulpd
和ymm
寄存器,并且一次跳转 32 个字节而不是一次 8 个字节。
.L11:
vmovupd ymm0, YMMWORD PTR [rdx]
add rdx, 32
vmulpd ymm0, ymm0, ymm0
vmovupd YMMWORD PTR [rdx-32], ymm0
cmp rdx, rcx
jne .L11
Of course, this inner-loop snippet is a little misleading: it hides an immense amount of code used to deal with the remainder of the std::vector
if its size does not divide evenly by 4. Still, my main point is that yes, you actually can do marginally better than reserve
+ push_back
(this surprised me quite a bit when I first found out), and that it would be significantly better if we weren't dealing with exp
in particular.当然,这个内循环片段有点误导:如果
std::vector
的大小没有被 4 整除,它隐藏了大量用于处理剩余部分的代码。 不过,我的主要观点是是的,你实际上可以做得比reserve
+ push_back
好一点(当我第一次发现时,这让我很惊讶),如果我们不特别处理exp
会好得多。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.