简体   繁体   中英

Understanding volatile asm vs volatile variable

We consider the following program, that is just timing a loop:

#include <cstdlib>

std::size_t count(std::size_t n)
{
#ifdef VOLATILEVAR
    volatile std::size_t i = 0;
#else
    std::size_t i = 0;
#endif
    while (i < n) {
#ifdef VOLATILEASM
        asm volatile("": : :"memory");
#endif
        ++i;
    }
    return i;
}

int main(int argc, char* argv[])
{
    return count(argc > 1 ? std::atoll(argv[1]) : 1);
}

For readability, the version with both volatile variable and volatile asm reads as follow:

#include <cstdlib>

std::size_t count(std::size_t n)
{
    volatile std::size_t i = 0;
    while (i < n) {
        asm volatile("": : :"memory");
        ++i;
    }
    return i;
}

int main(int argc, char* argv[])
{
    return count(argc > 1 ? std::atoll(argv[1]) : 1);
}

Compilation under g++ 8 with g++ -Wall -Wextra -g -std=c++11 -O3 loop.cpp -o loop gives roughly the following timings:

  • default: 0m0.001s
  • -DVOLATILEASM: 0m1.171s
  • -DVOLATILEVAR: 0m5.954s
  • -DVOLATILEVAR -DVOLATILEASM: 0m5.965s

The question I have is: why is that? The default version is normal since the loop is optimized away by the compiler. But I have harder time understanding why -DVOLATILEVAR is way longer than -DVOLATILEASM since both should force the loop to run.

Compiler explorer gives the following count function for -DVOLATILEASM :

count(unsigned long):
  mov rax, rdi
  test rdi, rdi
  je .L2
  xor edx, edx
.L3:
  add rdx, 1
  cmp rax, rdx
  jne .L3
.L2:
  ret

and for -DVOLATILEVAR (and the combined -DVOLATILEASM -DVOLATILEVAR ):

count(unsigned long):
  mov QWORD PTR [rsp-8], 0
  mov rax, QWORD PTR [rsp-8]
  cmp rdi, rax
  jbe .L2
.L3:
  mov rax, QWORD PTR [rsp-8]
  add rax, 1
  mov QWORD PTR [rsp-8], rax
  mov rax, QWORD PTR [rsp-8]
  cmp rax, rdi
  jb .L3
.L2:
  mov rax, QWORD PTR [rsp-8]
  ret

Why is the exact reason of that? Why does the volatile qualification of the variable prevents the compiler from doing the same loop as the one with asm volatile ?

When you make i volatile you tell the compiler that something that it doesn't know about can change its value. That means it is forced to load it's value every time you use it and it has to store it every time you write to it. When i is not volatile the compiler can optimize that synchronization away.

-DVOLATILEVAR forces the compiler to keep the loop counter in memory, so the loop bottlenecks on the latency of store/reload (store forwarding), ~5 cycles + the latency of an add 1 cycle.

Every assignment to and read from volatile int i is considered an observable side-effect of the program that the optimizer has to make happen in memory , not just a register. This is what volatile means.

There's also a reload for the compare, but that's only a throughput issue, not latency. The ~6 cycle loop carried data dependency means your CPU doesn't bottleneck on any throughput limits.

This is similar to what you'd get from -O0 compiler output, so have a look at my answer on Adding a redundant assignment speeds up code when compiled without optimization for more about loops like that, and x86 store-forwarding.


With only VOLATILEASM , the empty asm template ( "" ), has to run the right number of times. Being empty, it doesn't add any instructions to the loop, so you're left with a 2-uop add / cmp+jne loop that can run at 1 iteration per clock on modern x86 CPUs.

Critically, the loop counter can stay in a register, despite the compiler memory barrier. A "memory" clobber is treated like a call to a non-inline function : it might read or modify any object that it might possibly have a reference to, but that does not include local variables that have never had their address escape the function . (ie we never called sscanf("0", "%d", &i) or posix_memalign(&i, 64, 1234) . But if we did, then the "memory" barrier would have to spill / reload it, because an external function could have saved a pointer to the object.

ie a "memory" clobber is only a full compiler barrier for objects that could possibly be visible outside the current function. This is really only an issue when messing around and looking at compiler output to see what barriers do what, because a barrier can only matter for multi-threading correctness for variables that other threads could possible have a pointer to.

And BTW, your asm statement is already implicitly volatile because it has no output operands. (See Extended-Asm#Volatile in the gcc manual).

You can add a dummy output to make a non-volatile asm statement the compiler can optimize away, but unfortunately gcc still keep the empty loop after eliminating a non-volatile asm statement from it. If i 's address has escaped the function, removing the asm statement entirely turns the loop into a single compare jump over a store, right before the function returns. I think it would be legal to simply return without ever storing to that local, because there's no a correct program can know that it managed to read i from another thread before i went out of scope.

But anyway, here's the source I used. As I said, note that there's always an asm statement here, and I'm controlling whether it's volatile or not.

#include <stdlib.h>
#include <stdio.h>

#ifndef VOLATILEVAR   // compile with -DVOLATILEVAR=volatile  to apply that
#define VOLATILEVAR
#endif

#ifndef VOLATILEASM  // Different from your def; yours drops the whole asm statement
#define VOLATILEASM
#endif

// note I ported this to also be valid C, but I didn't try -xc to compile as C.
size_t count(size_t n)
{
    int dummy;  // asm with no outputs is implicitly volatile
    VOLATILEVAR size_t i = 0;
    sscanf("0", "%zd", &i);
    while (i < n) {
        asm  VOLATILEASM ("nop # operand = %0": "=r"(dummy) : :"memory");
        ++i;
    }
    return i;
}

compiles (with gcc4.9 and newer -O3, neither VOLATILE enabled) to this weird asm. ( Godbolt compiler explorer with gcc and clang ):

 # gcc8.1 -O3   with sscanf(.., &i) but non-volatile asm
 # the asm nop doesn't appear anywhere, but gcc is making clunky code.
.L8:
    mov     rdx, rax  # i, <retval>
.L3:                                        # first iter entry point
    lea     rax, [rdx+1]      # <retval>,
    cmp     rax, rbx  # <retval>, n
    jb      .L8 #,

Nice job, gcc.... gcc4.8 -O3 avoids pulling an extra mov inside the loop:

 # gcc4.8 -O3   with sscanf(.., &i) but non-volatile asm
.L3:
    add     rdx, 1    # i,
    cmp     rbx, rdx  # n, i
    ja      .L3 #,

    mov     rax, rdx  # i.0, i   # outside the loop

Anyway, without the dummy output operand, or with volatile , gcc8.1 gives us:

 # gcc8.1  with sscanf(&i) and asm volatile("nop" ::: "memory")
.L3:
    nop # operand = eax     # dummy
    mov     rax, QWORD PTR [rsp+8]    # tmp96, i
    add     rax, 1    # <retval>,
    mov     QWORD PTR [rsp+8], rax    # i, <retval>
    cmp     rax, rbx  # <retval>, n
    jb      .L3 #,

So we see the same store/reload of the loop counter, only difference from volatile i being the cmp doesn't need to reload it.

I used nop instead of just a comment because Godbolt hides comment-only lines by default, and I wanted to see it. For gcc, it's purely a text substitution: we're looking at the compiler's asm output with operands substituted into the template before it's sent to the assembler. For clang, there might be some effect because the asm has to be valid (ie actually assemble correctly).

If we comment out the scanf and remove the dummy output operand, we get a register-only loop with the nop in it. But keep the dummy output operand and the nop doesn't appear anywhere.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM