简体   繁体   English

了解易失性asm与易失性变量

[英]Understanding volatile asm vs volatile variable

We consider the following program, that is just timing a loop: 我们考虑以下程序,这只是定时循环:

#include <cstdlib>

std::size_t count(std::size_t n)
{
#ifdef VOLATILEVAR
    volatile std::size_t i = 0;
#else
    std::size_t i = 0;
#endif
    while (i < n) {
#ifdef VOLATILEASM
        asm volatile("": : :"memory");
#endif
        ++i;
    }
    return i;
}

int main(int argc, char* argv[])
{
    return count(argc > 1 ? std::atoll(argv[1]) : 1);
}

For readability, the version with both volatile variable and volatile asm reads as follow: 为了便于阅读,具有volatile变量和volatile asm的版本如下:

#include <cstdlib>

std::size_t count(std::size_t n)
{
    volatile std::size_t i = 0;
    while (i < n) {
        asm volatile("": : :"memory");
        ++i;
    }
    return i;
}

int main(int argc, char* argv[])
{
    return count(argc > 1 ? std::atoll(argv[1]) : 1);
}

Compilation under g++ 8 with g++ -Wall -Wextra -g -std=c++11 -O3 loop.cpp -o loop gives roughly the following timings: 使用g++ -Wall -Wextra -g -std=c++11 -O3 loop.cpp -o loopg++ 8下进行编译的时间大致如下:

  • default: 0m0.001s
  • -DVOLATILEASM: 0m1.171s
  • -DVOLATILEVAR: 0m5.954s
  • -DVOLATILEVAR -DVOLATILEASM: 0m5.965s

The question I have is: why is that? 我的问题是:为什么呢? The default version is normal since the loop is optimized away by the compiler. 默认版本是正常的,因为编译器已对循环进行了优化。 But I have harder time understanding why -DVOLATILEVAR is way longer than -DVOLATILEASM since both should force the loop to run. 但是我很难理解为什么-DVOLATILEVAR-DVOLATILEASM更长,因为两者都应强制循环运行。

Compiler explorer gives the following count function for -DVOLATILEASM : 编译器资源管理器-DVOLATILEASM提供以下count功能:

count(unsigned long):
  mov rax, rdi
  test rdi, rdi
  je .L2
  xor edx, edx
.L3:
  add rdx, 1
  cmp rax, rdx
  jne .L3
.L2:
  ret

and for -DVOLATILEVAR (and the combined -DVOLATILEASM -DVOLATILEVAR ): 对于-DVOLATILEVAR (以及组合的-DVOLATILEASM -DVOLATILEVAR ):

count(unsigned long):
  mov QWORD PTR [rsp-8], 0
  mov rax, QWORD PTR [rsp-8]
  cmp rdi, rax
  jbe .L2
.L3:
  mov rax, QWORD PTR [rsp-8]
  add rax, 1
  mov QWORD PTR [rsp-8], rax
  mov rax, QWORD PTR [rsp-8]
  cmp rax, rdi
  jb .L3
.L2:
  mov rax, QWORD PTR [rsp-8]
  ret

Why is the exact reason of that? 为什么会这样呢? Why does the volatile qualification of the variable prevents the compiler from doing the same loop as the one with asm volatile ? 为什么变量的volatile限定条件会阻止编译器执行与asm volatile相同的循环?

When you make i volatile you tell the compiler that something that it doesn't know about can change its value. 当您使i volatile您告诉编译器它不知道的某些内容可以更改其值。 That means it is forced to load it's value every time you use it and it has to store it every time you write to it. 这意味着每次使用它时都必须加载它的值,并且每次写入它时都必须存储它。 When i is not volatile the compiler can optimize that synchronization away. i volatile ,编译器可以优化该同步。

-DVOLATILEVAR forces the compiler to keep the loop counter in memory, so the loop bottlenecks on the latency of store/reload (store forwarding), ~5 cycles + the latency of an add 1 cycle. -DVOLATILEVAR强制编译器将循环计数器保留在内存中,因此循环瓶颈会导致存储/重新加载(存储转发)的延迟, -DVOLATILEVAR个周期+ add 1个周期的延迟。

Every assignment to and read from volatile int i is considered an observable side-effect of the program that the optimizer has to make happen in memory , not just a register. 每次对volatile int i赋值和从volatile int i读取的赋值都被认为是优化程序必须在内存中发生的可观察到的副作用,而不仅仅是寄存器。 This is what volatile means. 这就是volatile意思。

There's also a reload for the compare, but that's only a throughput issue, not latency. 还需要重新加载以进行比较,但这只是吞吐量问题,而不是延迟问题。 The ~6 cycle loop carried data dependency means your CPU doesn't bottleneck on any throughput limits. 〜6个循环循环带有数据依赖性,这意味着您的CPU不受任何吞吐量限制的瓶颈。

This is similar to what you'd get from -O0 compiler output, so have a look at my answer on Adding a redundant assignment speeds up code when compiled without optimization for more about loops like that, and x86 store-forwarding. 这与您从-O0编译器输出中获得的结果相似,因此请看一下我的回答: 添加编译时的冗余分配可加快代码的速度,而无需对诸如此类的更多循环以及x86存储转发进行优化。


With only VOLATILEASM , the empty asm template ( "" ), has to run the right number of times. 仅使用VOLATILEASM ,空的asm模板( "" )必须运行正确的次数。 Being empty, it doesn't add any instructions to the loop, so you're left with a 2-uop add / cmp+jne loop that can run at 1 iteration per clock on modern x86 CPUs. 为空时,它不会向循环添加任何指令,因此您剩下一个2 uop add / cmp + jne循环,该循环可以在现代x86 CPU上以每个时钟1次迭代的速度运行。

Critically, the loop counter can stay in a register, despite the compiler memory barrier. 至关重要的是,尽管存在编译器内存障碍,循环计数器仍可以保留在寄存器中。 A "memory" clobber is treated like a call to a non-inline function : it might read or modify any object that it might possibly have a reference to, but that does not include local variables that have never had their address escape the function . "memory"破坏器被视为对非内联函数的调用 :它可以读取或修改它可能引用的任何对象,但不包括从未使用其地址转义过该函数的局部变量。 (ie we never called sscanf("0", "%d", &i) or posix_memalign(&i, 64, 1234) . But if we did, then the "memory" barrier would have to spill / reload it, because an external function could have saved a pointer to the object. (即我们从未调用过sscanf("0", "%d", &i)posix_memalign(&i, 64, 1234) 。但是,如果这样做了,那么"memory"屏障将不得不溢出/重新加载它,因为外部函数可以保存指向该对象的指针。

ie a "memory" clobber is only a full compiler barrier for objects that could possibly be visible outside the current function. 即, "memory"破坏对象只是对可能在当前函数外部可见的对象的完整编译器屏障。 This is really only an issue when messing around and looking at compiler output to see what barriers do what, because a barrier can only matter for multi-threading correctness for variables that other threads could possible have a pointer to. 这实际上只是一个问题,当您四处查看编译器的输出以查看哪些障碍可以做什么时,因为障碍仅对其他线程可能指向的变量的多线程正确性很重要。

And BTW, your asm statement is already implicitly volatile because it has no output operands. 顺便说一句,您的asm语句已经隐式volatile因为它没有输出操作数。 (See Extended-Asm#Volatile in the gcc manual). (请参阅gcc手册中的Extended-Asm#Volatile )。

You can add a dummy output to make a non-volatile asm statement the compiler can optimize away, but unfortunately gcc still keep the empty loop after eliminating a non-volatile asm statement from it. 您可以添加虚拟输出以使编译器可以优化其非易失性asm语句,但不幸的是, gcc在从中删除了非易失性asm语句后仍保持空循环。 If i 's address has escaped the function, removing the asm statement entirely turns the loop into a single compare jump over a store, right before the function returns. 如果i的地址转义了该函数,则删除asm语句会完全在函数返回之前将循环变成对存储的单个比较跳转。 I think it would be legal to simply return without ever storing to that local, because there's no a correct program can know that it managed to read i from another thread before i went out of scope. 我认为直接返回而不存储到该本地是合法的,因为没有正确的程序可以知道它在i超出范围之前设法从另一个线程读取了i

But anyway, here's the source I used. 但是无论如何,这是我使用的来源。 As I said, note that there's always an asm statement here, and I'm controlling whether it's volatile or not. 正如我说的,请注意,这里总是有一个asm语句,并且我正在控制它是否volatile

#include <stdlib.h>
#include <stdio.h>

#ifndef VOLATILEVAR   // compile with -DVOLATILEVAR=volatile  to apply that
#define VOLATILEVAR
#endif

#ifndef VOLATILEASM  // Different from your def; yours drops the whole asm statement
#define VOLATILEASM
#endif

// note I ported this to also be valid C, but I didn't try -xc to compile as C.
size_t count(size_t n)
{
    int dummy;  // asm with no outputs is implicitly volatile
    VOLATILEVAR size_t i = 0;
    sscanf("0", "%zd", &i);
    while (i < n) {
        asm  VOLATILEASM ("nop # operand = %0": "=r"(dummy) : :"memory");
        ++i;
    }
    return i;
}

compiles (with gcc4.9 and newer -O3, neither VOLATILE enabled) to this weird asm. 编译(使用gcc4.9和更高版本的-O3,均未启用VOLATILE)到该奇怪的asm。 ( Godbolt compiler explorer with gcc and clang ): 带有gcc和clang的Godbolt编译器资源管理器 ):

 # gcc8.1 -O3   with sscanf(.., &i) but non-volatile asm
 # the asm nop doesn't appear anywhere, but gcc is making clunky code.
.L8:
    mov     rdx, rax  # i, <retval>
.L3:                                        # first iter entry point
    lea     rax, [rdx+1]      # <retval>,
    cmp     rax, rbx  # <retval>, n
    jb      .L8 #,

Nice job, gcc.... gcc4.8 -O3 avoids pulling an extra mov inside the loop: 干得好,GCC .... gcc4.8 -O3避免拉一个额外的mov内循环:

 # gcc4.8 -O3   with sscanf(.., &i) but non-volatile asm
.L3:
    add     rdx, 1    # i,
    cmp     rbx, rdx  # n, i
    ja      .L3 #,

    mov     rax, rdx  # i.0, i   # outside the loop

Anyway, without the dummy output operand, or with volatile , gcc8.1 gives us: 无论如何,如果没有伪输出操作数或带有volatile ,gcc8.1会给我们:

 # gcc8.1  with sscanf(&i) and asm volatile("nop" ::: "memory")
.L3:
    nop # operand = eax     # dummy
    mov     rax, QWORD PTR [rsp+8]    # tmp96, i
    add     rax, 1    # <retval>,
    mov     QWORD PTR [rsp+8], rax    # i, <retval>
    cmp     rax, rbx  # <retval>, n
    jb      .L3 #,

So we see the same store/reload of the loop counter, only difference from volatile i being the cmp doesn't need to reload it. 因此,我们看到了循环计数器的相同存储/重载,只是与volatile i区别( volatile icmp不需要重载)。

I used nop instead of just a comment because Godbolt hides comment-only lines by default, and I wanted to see it. 我使用nop而不是仅添加注释,因为Godbolt默认情况下隐藏仅注释行,我希望看到它。 For gcc, it's purely a text substitution: we're looking at the compiler's asm output with operands substituted into the template before it's sent to the assembler. 对于gcc,它纯粹是文本替换:我们正在查看编译器的asm输出,其中将操作数替换为模板,然后将其发送到汇编器。 For clang, there might be some effect because the asm has to be valid (ie actually assemble correctly). 对于clang来说,可能会有一些效果,因为asm必须有效(即实际上正确地组装了)。

If we comment out the scanf and remove the dummy output operand, we get a register-only loop with the nop in it. 如果我们注释掉scanf并删除伪输出操作数,则会得到其中只有nop的仅寄存器循环。 But keep the dummy output operand and the nop doesn't appear anywhere. 但是请保留伪输出操作数,并且nop不会出现在任何地方。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM