简体   繁体   中英

C++ - Apply std::exp to an std::vector

Is there a faster (from a performance perspective) way than simply do

std::vector<double> y;
y.reserve(x.size());
for(size_t i = 0; i < x.size(); ++i)
    y.push_back(std::exp(x[i]));

If you require maximum precision to the nearest ULP, this is likely as fast as you're going to get.

If you can accept some approximation errors, there are much faster methods that use SIMD .

push_back , surprisingly, has a little bit of overhead, because it doesn't actually know that you've reserved enough space, so it always has to check. Because this check can change control flow between loop iterations, push_back precludes automatic vectorization by the compiler.

Consider these two functions, where the first one uses push_back , while the second one modifies a copy (or moved-into value) in-place:

auto exp1(std::vector<double> const& xs) -> std::vector<double> {
    auto ys = std::vector<double>{};
    ys.reserve(xs.size());
    for(auto x : xs){ ys.push_back(std::exp(x)); }
}

auto exp2(std::vector<double> xs) -> std::vector<double> {
    for(auto & x : xs){ x = std::exp(x); }
    return xs;
}

We'll look at the assembly output , if compiled in GCC 9.1 with

gcc -std=c++17 -O3 -march=skylake-avx512

Here is exp1 's inner loop (embedded in quite a bit of additional code which will never be executed because you've already reserve d):

.L45:
        add     rbx, 8
        vmovsd  QWORD PTR [r14], xmm0
        add     r14, 8
        cmp     r12, rbx
        je      .L44
.L18:
        vmovsd  xmm0, QWORD PTR [rbx]
        call    exp
        vmovsd  QWORD PTR [rsp], xmm0
        cmp     rbp, r14
        jne     .L45

And here's exp2 's:

.L53:
        vmovsd  xmm0, QWORD PTR [rbx]
        add     rbx, 8
        call    exp
        vmovsd  QWORD PTR [rbx-8], xmm0
        cmp     rbp, rbx
        jne     .L53

In practice, they are basically the same, because exp is complicated and GCC doesn't know how to automatically vectorize it. However, consider the case where something much simpler happens in the inner loop:

auto sq1(std::vector<double> const& xs) -> std::vector<double> {
    auto ys = std::vector<double>{};
    ys.reserve(xs.size());
    for(auto x : xs){ ys.push_back(x*x); }
}

auto sq2(std::vector<double> xs) -> std::vector<double> {
    for(auto & x : xs){ x *= x; }
    return xs;
}

Here's sq1 's inner loop:

.L89:
        vmovsd  QWORD PTR [rsi], xmm0
        add     rbx, 8
        add     rsi, 8
        mov     QWORD PTR [rsp+24], rsi
        cmp     rbp, rbx
        je      .L72
.L75:
        vmovsd  xmm0, QWORD PTR [rbx]
        mov     rsi, QWORD PTR [rsp+24]
        vmulsd  xmm0, xmm0, xmm0
        vmovsd  QWORD PTR [rsp+8], xmm0
        cmp     rsi, QWORD PTR [rsp+32]
        jne     .L89

Here's sq2 's. Note that it uses vmulpd and ymm registers, and that it jumps by 32 bytes at a time rather than 8 at a time.

.L11:
        vmovupd ymm0, YMMWORD PTR [rdx]
        add     rdx, 32
        vmulpd  ymm0, ymm0, ymm0
        vmovupd YMMWORD PTR [rdx-32], ymm0
        cmp     rdx, rcx
        jne     .L11

Of course, this inner-loop snippet is a little misleading: it hides an immense amount of code used to deal with the remainder of the std::vector if its size does not divide evenly by 4. Still, my main point is that yes, you actually can do marginally better than reserve + push_back (this surprised me quite a bit when I first found out), and that it would be significantly better if we weren't dealing with exp in particular.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM