简体   繁体   English

“asm”、“__asm”和“__asm__”有什么区别?

[英]What is the difference between 'asm', '__asm' and '__asm__'?

As far as I can tell, the only difference between __asm { ... };据我所知, __asm { ... };之间的唯一区别and __asm__("...");__asm__("..."); is that the first uses mov eax, var and the second uses movl %0, %%eax with :"=r" (var) at the end.是第一个使用mov eax, var ,第二个使用movl %0, %%eax:"=r" (var)最后。 What other differences are there?还有哪些不同之处? And what about just asm ?那么asm呢?

There's a massive difference between MSVC inline asm and GNU C inline asm. MSVC 内联汇编和 GNU C 内联汇编之间存在巨大差异。 GCC syntax is designed for optimal output without wasted instructions, for wrapping a single instruction or something. GCC 语法是为优化输出而设计的,没有浪费指令,用于包装单个指令或其他东西。 MSVC syntax is designed to be fairly simple, but AFAICT it's impossible to use without the latency and extra instructions of a round trip through memory for your inputs and outputs. MSVC 语法设计得相当简单,但 AFAICT 不可能在没有延迟和额外指令的情况下使用通过内存进行输入和输出的往返。

If you're using inline asm for performance reasons, this makes MSVC inline asm only viable if you write a whole loop entirely in asm, not for wrapping short sequences in an inline function.如果您出于性能原因使用内联 asm,这使得 MSVC 内联 asm 仅在您完全在 asm 中编写整个循环时才可行,而不是用于将短序列包装在内联函数中。 The example below (wrapping idiv with a function) is the kind of thing MSVC is bad at: ~8 extra store/load instructions.下面的例子(用函数包装idiv )是 MSVC 不idiv的东西:~8 个额外的存储/加载指令。

MSVC inline asm (used by MSVC and probably icc, maybe also available in some commercial compilers): MSVC 内联汇编(由 MSVC 使用,可能还有 icc,也可能在某些商业编译器中可用):

  • looks at your asm to figure out which registers your code steps on.查看您的 asm 以确定哪个注册您的代码步骤。
  • can only transfer data via memory.只能通过内存传输数据。 Data that was live in registers is stored by the compiler to prepare for your mov ecx, shift_count , for example.例如,寄存器中的数据由编译器存储以准备mov ecx, shift_count So using a single asm instruction that the compiler won't generate for you involves a round-trip through memory on the way in and on the way out.因此,使用编译器不会为您生成的单个 asm 指令涉及在进出途中的内存往返。
  • more beginner-friendly, but often impossible to avoid overhead getting data in/out .对初学者更友好,但通常无法避免输入/输出数据的开销 Even besides the syntax limitations, the optimizer in current versions of MSVC isn't good at optimizing around inline asm blocks, either.即使除了语法限制之外,当前版本的 MSVC 中的优化器也不擅长围绕内联 asm 块进行优化。

GNU C inline asm is not a good way to learn asm . GNU C 内联 asm不是学习 asm 的好方法 You have to understand asm very well so you can tell the compiler about your code.您必须非常了解 asm,才能将代码告诉编译器。 And you have to understand what compilers need to know.你必须了解编译器需要知道什么。 That answer also has links to other inline-asm guides and Q&As.该答案还包含指向其他内联汇编指南和问答的链接。 The tag wiki has lots of good stuff for asm in general, but just links to that for GNU inline asm. 标签 wiki 有很多关于 asm 的好东西,但只是链接到 GNU 内联 asm。 (The stuff in that answer is applicable to GNU inline asm on non-x86 platforms, too.) (该答案中的内容也适用于非 x86 平台上的 GNU 内联汇编。)

GNU C inline asm syntax is used by gcc, clang, icc, and maybe some commercial compilers which implement GNU C: gcc、clang、icc 以及一些实现 GNU C 的商业编译器可能使用 GNU C 内联 asm 语法:

  • You have to tell the compiler what you clobber.你必须告诉编译器你破坏了什么。 Failure to do this will lead to breakage of surrounding code in non-obvious hard-to-debug ways.不这样做将导致周围代码以不明显的难以调试的方式被破坏。
  • Powerful but hard to read, learn, and use syntax for telling the compiler how to supply inputs, and where to find outputs.功能强大但难以阅读、学习和使用语法来告诉编译器如何提供输入以及在哪里找到输出。 eg "c" (shift_count) will get the compiler to put the shift_count variable into ecx before your inline asm runs.例如, "c" (shift_count)将让编译器在您的内联 asm 运行之前将shift_count变量放入ecx
  • extra clunky for large blocks of code, because the asm has to be inside a string constant.对于大代码块来说更加笨重,因为 asm 必须在字符串常量内。 So you typically need所以你通常需要

    "insn %[inputvar], %%reg\\n\\t" // comment "insn2 %%reg, %[outputvar]\\n\\t"
  • very unforgiving / harder, but allows lower overhead esp.非常无情/更难,但允许更低的开销,尤其是。 for wrapping single instructions .用于包装单个指令 (wrapping single instructions was the original design intent, which is why you have to specially tell the compiler about early clobbers to stop it from using the same register for an input and output if that's a problem.) (包装单个指令是最初的设计意图,这就是为什么您必须特别告诉编译器有关早期破坏的原因,以阻止它在输入和输出中使用相同的寄存器,如果这是一个问题。)


Example: full-width integer division ( div )示例:全角整数除法 ( div )

On a 32bit CPU, dividing a 64bit integer by a 32bit integer, or doing a full-multiply (32x32->64), can benefit from inline asm.在 32 位 CPU 上,将 64 位整数除以 32 位整数,或进行全乘 (32x32->64),都可以从内联汇编中受益。 gcc and clang don't take advantage of idiv for (int64_t)a / (int32_t)b , probably because the instruction faults if the result doesn't fit in a 32bit register. gcc 和 clang 不利用idiv for (int64_t)a / (int32_t)b ,可能是因为如果结果不适合 32 位寄存器,指令就会出错。 So unlike this Q&A about getting quotient and remainder from one div , this is a use-case for inline asm.因此,与这个关于从一个div获取商和余数的问答不同,这是内联汇编的一个用例。 (Unless there's a way to inform the compiler that the result will fit, so idiv won't fault.) (除非有一种方法可以通知编译器结果适合,否则 idiv 不会出错。)

We'll use calling conventions that put some args in registers (with hi even in the right register), to show a situation that's closer to what you'd see when inlining a tiny function like this.我们将使用将一些 args 放入寄存器的调用约定(即使在正确的寄存器中也带有hi ),以显示更接近于内联这样的小函数时所看到的情况。


MSVC MSVC

Be careful with register-arg calling conventions when using inline-asm.使用 inline-asm 时要注意 register-arg 调用约定。 Apparently the inline-asm support is so badly designed/implemented that the compiler might not save/restore arg registers around the inline asm, if those args aren't used in the inline asm .显然,内联 asm 支持的设计/实现非常糟糕,以至于编译器可能无法保存/恢复内联 asm 周围的 arg 寄存器,如果这些 args 没有在内联 asm 中使用 Thanks @RossRidge for pointing this out.感谢@RossRidge 指出这一点。

// MSVC.  Be careful with _vectorcall & inline-asm: see above
// we could return a struct, but that would complicate things
int _vectorcall div64(int hi, int lo, int divisor, int *premainder) {
    int quotient, tmp;
    __asm {
        mov   edx, hi;
        mov   eax, lo;
        idiv   divisor
        mov   quotient, eax
        mov   tmp, edx;
        // mov ecx, premainder   // Or this I guess?
        // mov   [ecx], edx
    }
    *premainder = tmp;
    return quotient;     // or omit the return with a value in eax
}

Update: apparently leaving a value in eax or edx:eax and then falling off the end of a non-void function (without a return ) is supported, even when inlining .更新:显然在eaxedx:eax留下一个值,然后从非 void 函数(没有return )的末尾掉下来,即使在 inlining 时也是如此 I assume this works only if there's no code after the asm statement.我认为这仅在asm语句之后没有代码时才有效。 See Does __asm{};请参阅__asm{}; return the value of eax? 返回 eax 的值? This avoids the store/reloads for the output (at least for quotient ), but we can't do anything about the inputs.这避免了输出的存储/重新加载(至少对于quotient ),但我们不能对输入做任何事情。 In a non-inline function with stack args, they will be in memory already, but in this use-case we're writing a tiny function that could usefully inline.在带有堆栈参数的非内联函数中,它们已经在内存中,但在这个用例中,我们正在编写一个可以有效内联的小函数。


Compiled with MSVC 19.00.23026 /O2 on rextester (with a main() that finds the directory of the exe and dumps the compiler's asm output to stdout ).在 rextester 上使用 MSVC 19.00.23026 /O2编译(使用main()查找 exe 目录并将编译器的 asm 输出转储到 stdout )。

## My added comments use. ##
; ... define some symbolic constants for stack offsets of parameters
; 48   : int ABI div64(int hi, int lo, int divisor, int *premainder) {
    sub esp, 16                 ; 00000010H
    mov DWORD PTR _lo$[esp+16], edx      ## these symbolic constants match up with the names of the stack args and locals
    mov DWORD PTR _hi$[esp+16], ecx

    ## start of __asm {
    mov edx, DWORD PTR _hi$[esp+16]
    mov eax, DWORD PTR _lo$[esp+16]
    idiv    DWORD PTR _divisor$[esp+12]
    mov DWORD PTR _quotient$[esp+16], eax  ## store to a local temporary, not *premainder
    mov DWORD PTR _tmp$[esp+16], edx
    ## end of __asm block

    mov ecx, DWORD PTR _premainder$[esp+12]
    mov eax, DWORD PTR _tmp$[esp+16]
    mov DWORD PTR [ecx], eax               ## I guess we should have done this inside the inline asm so this would suck slightly less
    mov eax, DWORD PTR _quotient$[esp+16]  ## but this one is unavoidable
    add esp, 16                 ; 00000010H
    ret 8

There's a ton of extra mov instructions, and the compiler doesn't even come close to optimizing any of it away.有大量额外的 mov 指令,编译器甚至无法优化其中的任何一条。 I thought maybe it would see and understand the mov tmp, edx inside the inline asm, and make that a store to premainder .我想也许它会看到并理解内联汇编中的mov tmp, edx ,并将其作为premainder的存储。 But that would require loading premainder from the stack into a register before the inline asm block, I guess.但是,我猜这需要premainder联汇编块之前将堆栈中的premainder加载到寄存器中。

This function is actually worse with _vectorcall than with the normal everything-on-the-stack ABI. _vectorcall这个函数实际上比普通的一切堆栈 ABI更糟糕 With two inputs in registers, it stores them to memory so the inline asm can load them from named variables.通过寄存器中的两个输入,它将它们存储到内存中,以便内联 asm 可以从命名变量中加载它们。 If this were inlined, even more of the parameters could potentially be in the regs, and it would have to store them all, so the asm would have memory operands!如果这是内联的,则更多的参数可能会在 regs 中,并且必须将它们全部存储,因此 asm 将具有内存操作数! So unlike gcc, we don't gain much from inlining this.因此,与 gcc 不同的是,我们不会从内联中获得太多收益。

Doing *premainder = tmp inside the asm block means more code written in asm, but does avoid the totally braindead store/load/store path for the remainder.在 asm 块中执行*premainder = tmp意味着更多的代码是用 asm 编写的,但确实避免了剩余的完全脑残的存储/加载/存储路径。 This reduces the instruction count by 2 total, down to 11 (not including the ret ).这将指令总数减少了 2 条,降至 11 条(不包括ret )。

I'm trying to get the best possible code out of MSVC, not "use it wrong" and create a straw-man argument.我试图从 MSVC 中获得最好的代码,而不是“错误地使用它”并创建一个稻草人的论点。 But AFAICT it's horrible for wrapping very short sequences.但是 AFAICT 包装非常短的序列是可怕的。 Presumably there's an intrinsic function for 64/32 -> 32 division that allows the compiler to generate good code for this particular case, so the entire premise of using inline asm for this on MSVC could be a straw-man argument .大概有一个 64/32 -> 32 除法的内在函数,它允许编译器为这种特殊情况生成好的代码,所以在 MSVC 上使用内联 asm 的整个前提可能是一个稻草人的论点 But it does show you that intrinsics are much better than inline asm for MSVC.但它确实表明你内在函数比MSVC内联汇编好得多


GNU C (gcc/clang/icc) GNU C (gcc/clang/icc)

Gcc does even better than the output shown here when inlining div64, because it can typically arrange for the preceding code to generate the 64bit integer in edx:eax in the first place.内联 div64 时,gcc 甚至比此处显示的输出做得更好,因为它通常可以安排前面的代码首先在 edx:eax 中生成 64 位整数。

I can't get gcc to compile for the 32bit vectorcall ABI.我无法让 gcc 为 32 位 vectorcall ABI 进行编译。 Clang can, but it sucks at inline asm with "rm" constraints (try it on the godbolt link: it bounces function arg through memory instead of using the register option in the constraint). Clang 可以,但它在带有"rm"约束的内联 asm 中很糟糕(在 Godbolt 链接上尝试:它通过内存反弹函数 arg 而不是在约束中使用 register 选项)。 The 64bit MS calling convention is close to the 32bit vectorcall, with the first two params in edx, ecx. 64 位 MS 调用约定接近 32 位 vectorcall,前两个参数在 edx、ecx 中。 The difference is that 2 more params go in regs before using the stack (and that the callee doesn't pop the args off the stack, which is what the ret 8 was about in the MSVC output.)不同之处在于在使用堆栈之前还有 2 个参数进入 regs(并且被调用者不会从堆栈中弹出 args,这就是ret 8在 MSVC 输出中的含义。)

// GNU C
// change everything to int64_t to do 128b/64b -> 64b division
// MSVC doesn't do x86-64 inline asm, so we'll use 32bit to be comparable
int div64(int lo, int hi, int *premainder, int divisor) {
    int quotient, rem;
    asm ("idivl  %[divsrc]"
          : "=a" (quotient), "=d" (rem)    // a means eax,  d means edx
          : "d" (hi), "a" (lo),
            [divsrc] "rm" (divisor)        // Could have just used %0 instead of naming divsrc
            // note the "rm" to allow the src to be in a register or not, whatever gcc chooses.
            // "rmi" would also allow an immediate, but unlike adc, idiv doesn't have an immediate form
          : // no clobbers
        );
    *premainder = rem;
    return quotient;
}

compiled with gcc -m64 -O3 -mabi=ms -fverbose-asm . 使用gcc -m64 -O3 -mabi=ms -fverbose-asm编译 With -m32 you just get 3 loads, idiv, and a store, as you can see from changing stuff in that godbolt link.使用 -m32,您只会获得 3 个负载、idiv 和一个商店,正如您从该 Godbolt 链接中的更改内容中看到的那样。

mov     eax, ecx  # lo, lo
idivl  r9d      # divisor
mov     DWORD PTR [r8], edx       # *premainder_7(D), rem
ret

For 32bit vectorcall, gcc would do something like对于 32 位向量调用,gcc 会做类似的事情

## Not real compiler output, but probably similar to what you'd get
mov     eax, ecx               # lo, lo
mov     ecx, [esp+12]          # premainder
idivl   [esp+16]               # divisor
mov     DWORD PTR [ecx], edx   # *premainder_7(D), rem
ret   8

MSVC uses 13 instructions (not including the ret), compared to gcc's 4. With inlining, as I said, it potentially compiles to just one, while MSVC would still use probably 9. (It won't need to reserve stack space or load premainder ; I'm assuming it still has to store about 2 of the 3 inputs. Then it reloads them inside the asm, runs idiv , stores two outputs, and reloads them outside the asm. So that's 4 loads/stores for input, and another 4 for output.)与 gcc 的 4 条指令相比,MSVC 使用 13 条指令(不包括 ret)。正如我所说,通过内联,它可能只编译为一条,而 MSVC 仍可能使用 9 条指令。(它不需要保留堆栈空间或加载premainder ;我假设它仍然需要存储 3 个输入中的大约 2 个。然后它在 asm 中重新加载它们,运行idiv ,存储两个输出,然后在 asm 外部重新加载它们。所以这是 4 个用于输入的加载/存储,并且另外 4 个用于输出。)

Which one you use depends on your compiler.您使用哪一种取决于您的编译器。 This isn't standard like the C language.这不像 C 语言那样标准。

asm vs __asm__ in GCC GCC 中的asm__asm__

asm does not work with -std=c99 , you have two alternatives: asm不适用于-std=c99 ,您有两种选择:

  • use __asm__使用__asm__
  • use -std=gnu99使用-std=gnu99

More details: error: 'asm' undeclared (first use in this function)更多详细信息: 错误:'asm' 未声明(首次在此函数中使用)

__asm vs __asm__ in GCC GCC 中的__asm__asm__

I could not find where __asm is documented (notably not mentioned at https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gcc/Alternate-Keywords.html#Alternate-Keywords ), but from the GCC 8.1 source they are exactly the same:我找不到__asm在哪里记录(特别是在https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gcc/Alternate-Keywords.html#Alternate-Keywords 中没有提到),但来自GCC 8.1 源它们完全相同:

  { "__asm",        RID_ASM,    0 },
  { "__asm__",      RID_ASM,    0 },

so I would just use __asm__ which is documented.所以我只会使用记录的__asm__

With gcc compiler, it's not a big difference.使用 gcc 编译器,差别不大。 asm or __asm or __asm__ are same, they just use to avoid conflict namespace purpose (there's user defined function that name asm, etc.) asm__asm__asm__是相同的,它们只是用来避免命名空间冲突的目的(有用户定义的函数命名为 asm 等)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM