简体   繁体   English

C++ - 将 std::exp 应用于 std::vector

[英]C++ - Apply std::exp to an std::vector

Is there a faster (from a performance perspective) way than simply do有没有比简单地更快(从性能角度)的方法

std::vector<double> y;
y.reserve(x.size());
for(size_t i = 0; i < x.size(); ++i)
    y.push_back(std::exp(x[i]));

If you require maximum precision to the nearest ULP, this is likely as fast as you're going to get.如果您需要最接近的 ULP 的最大精度,这可能与您将要获得的一样快。

If you can accept some approximation errors, there are much faster methods that use SIMD .如果您可以接受一些近似误差,那么使用 SIMD 的方法要快得多

push_back , surprisingly, has a little bit of overhead, because it doesn't actually know that you've reserved enough space, so it always has to check.令人惊讶的是, push_back有一点开销,因为它实际上并不知道您预留了足够的空间,因此它总是需要检查。 Because this check can change control flow between loop iterations, push_back precludes automatic vectorization by the compiler.由于此检查可以更改循环迭代之间的控制流,因此push_back阻止编译器进行自动矢量化。

Consider these two functions, where the first one uses push_back , while the second one modifies a copy (or moved-into value) in-place:考虑这两个函数,第一个函数使用push_back ,而第二个函数就地修改副本(或移入值):

auto exp1(std::vector<double> const& xs) -> std::vector<double> {
    auto ys = std::vector<double>{};
    ys.reserve(xs.size());
    for(auto x : xs){ ys.push_back(std::exp(x)); }
}

auto exp2(std::vector<double> xs) -> std::vector<double> {
    for(auto & x : xs){ x = std::exp(x); }
    return xs;
}

We'll look at the assembly output , if compiled in GCC 9.1 with如果在 GCC 9.1 中编译,我们将查看程序集输出

gcc -std=c++17 -O3 -march=skylake-avx512

Here is exp1 's inner loop (embedded in quite a bit of additional code which will never be executed because you've already reserve d):这是exp1的内部循环(嵌入了相当多的额外代码,这些代码永远不会被执行,因为您已经reserve d):

.L45:
        add     rbx, 8
        vmovsd  QWORD PTR [r14], xmm0
        add     r14, 8
        cmp     r12, rbx
        je      .L44
.L18:
        vmovsd  xmm0, QWORD PTR [rbx]
        call    exp
        vmovsd  QWORD PTR [rsp], xmm0
        cmp     rbp, r14
        jne     .L45

And here's exp2 's:这是exp2的:

.L53:
        vmovsd  xmm0, QWORD PTR [rbx]
        add     rbx, 8
        call    exp
        vmovsd  QWORD PTR [rbx-8], xmm0
        cmp     rbp, rbx
        jne     .L53

In practice, they are basically the same, because exp is complicated and GCC doesn't know how to automatically vectorize it.在实践中,它们基本相同,因为exp很复杂,而且 GCC 不知道如何自动对其进行矢量化。 However, consider the case where something much simpler happens in the inner loop:但是,请考虑在内循环中发生更简单的事情的情况:

auto sq1(std::vector<double> const& xs) -> std::vector<double> {
    auto ys = std::vector<double>{};
    ys.reserve(xs.size());
    for(auto x : xs){ ys.push_back(x*x); }
}

auto sq2(std::vector<double> xs) -> std::vector<double> {
    for(auto & x : xs){ x *= x; }
    return xs;
}

Here's sq1 's inner loop:这是sq1的内部循环:

.L89:
        vmovsd  QWORD PTR [rsi], xmm0
        add     rbx, 8
        add     rsi, 8
        mov     QWORD PTR [rsp+24], rsi
        cmp     rbp, rbx
        je      .L72
.L75:
        vmovsd  xmm0, QWORD PTR [rbx]
        mov     rsi, QWORD PTR [rsp+24]
        vmulsd  xmm0, xmm0, xmm0
        vmovsd  QWORD PTR [rsp+8], xmm0
        cmp     rsi, QWORD PTR [rsp+32]
        jne     .L89

Here's sq2 's.这是sq2的。 Note that it uses vmulpd and ymm registers, and that it jumps by 32 bytes at a time rather than 8 at a time.请注意,它使用vmulpdymm寄存器,并且一次跳转 32 个字节而不是一次 8 个字节。

.L11:
        vmovupd ymm0, YMMWORD PTR [rdx]
        add     rdx, 32
        vmulpd  ymm0, ymm0, ymm0
        vmovupd YMMWORD PTR [rdx-32], ymm0
        cmp     rdx, rcx
        jne     .L11

Of course, this inner-loop snippet is a little misleading: it hides an immense amount of code used to deal with the remainder of the std::vector if its size does not divide evenly by 4. Still, my main point is that yes, you actually can do marginally better than reserve + push_back (this surprised me quite a bit when I first found out), and that it would be significantly better if we weren't dealing with exp in particular.当然,这个内循环片段有点误导:如果std::vector的大小没有被 4 整除,它隐藏了大量用于处理剩余部分的代码。 不过,我的主要观点是是的,你实际上可以做得比reserve + push_back好一点(当我第一次发现时,这让我很惊讶),如果我们不特别处理exp会好得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM