在AT＆T内联汇编中将float / double设置为恒定值

Question

I'm looking at increasing the runtime performance of a C++ library that I have written and profiled. 我正在寻找提高我编写和分析的C ++库的运行时性能的方法。 I'm very new to assembly (and inline assembly) and have a very basic question to ask. 我是组装（和内联汇编）的新手，有一个非常基本的问题要问。

How do I set the value of an xmm register (xmm, ymm, zmm, etc) to a constant float or double value using inline assembly? 如何使用内联汇编将xmm寄存器的值（xmm，ymm，zmm等）设置为恒定的float或double值？ I strongly prefer not to use GCC's extended assembly to make the code more portable to MSVC. 我强烈不希望使用GCC的扩展程序集使代码更易于移植到MSVC。 When compiling with -S, I see that GCC uses a .data section, however, I don't think I can use that in inline code. 使用-S进行编译时，我看到GCC使用了.data节，但是，我认为我不能在内联代码中使用该节。

For simplicity, let's say I want to implement the foo function in the following C code: 为了简单起见，假设我要在以下C代码中实现foo函数：

#include <cstdio>

void foo(double *val);
int main(int argc, char **argv) {
   double val = 0.0;

   foo(&val);
   printf("val: %lf\n", val);
   return 0;
}

void foo(double *val) {
   // return *val + 1.0.
   __asm__ (
      "movq -8(%rbp), %rax\n\t"   // move pointer from stack to rax.
      "movq (%rax), %xmm1\n\t"    // dereference pointer and move to xmm1.
      "?????????????"             // somehow move 1.0 to xmm0.
      "addsd %xmm1, %xmm0\n\t"    // add xmm1 to xmm0.
      "movsd %xmm0, (%rax)\n\t"   // move result back val.
   );
 }

I have tried using push $0x3ff0000000000000 and pushq $0x3ff0000000000000 to move the value to the stack and then potentially move it to xmm0, with the following results: 我尝试使用push $0x3ff0000000000000和pushq $0x3ff0000000000000将值移到堆栈，然后可能将其移到xmm0，结果如下：

"pushq $0x3ff0000000000000\\n\\t" = "Error: operand type mismatch for `push'". "pushq $0x3ff0000000000000\\n\\t" =”错误：'push'的操作数类型不匹配。“

"push $0x3ff00000\\n\\t" = Segmentation fault at this instruction. "push $0x3ff00000\\n\\t" =该指令出现分段错误。

Any help would be appreciated, and thanks in advance for your time. 任何帮助将不胜感激，并提前感谢您的时间。

Answer 1

You can't make your inline assembly code portable to Microsoft's C/C++ compiler for two reasons. 不能将内联汇编代码移植到Microsoft的C / C ++编译器中有两个原因。 The first is that syntax for asm statements is too different. 首先是asm语句的语法太不同了。 Microsoft's compiler expects something like asm { mov rax, [rbp + 8] } instead of asm("movq -8(%rbp), %rax\\n\\t") . 微软的编译器期望使用asm { mov rax, [rbp + 8] }而不是asm("movq -8(%rbp), %rax\\n\\t") 。 The second is that Microsoft 64-bit compilers don't support inline assembly. 第二点是Microsoft 64位编译器不支持内联汇编。

So you might as well do it right and use GCC's extended syntax. 因此，您也可以正确使用GCC的扩展语法。 As it is your inline assembly is horribly fragile. 因为它是内联程序集，所以非常脆弱。 You can't depend val being located at -8(%rbp) . 您不能认为val位于-8(%rbp) 。 The compiler might not even put it on the stack. 编译器甚至可能没有将其放在堆栈上。 You also can can't assume that the compiler won't mind you trashing RAX, XMM0 and XMM1. 您也不能假设编译器不会介意您破坏RAX，XMM0和XMM1。

So to do it right you need to tell the compilers what variables you want to use and what registers you're trashing. 因此，要正确执行此操作，您需要告诉编译器要使用哪些变量以及要破坏的寄存器。 Plus you you can let the compiler handle loading 1.0 into an XMM register. 另外，您还可以让编译器处理将1.0加载到XMM寄存器中的问题。 Something like this: 像这样：

asm ("movq (%0), %%xmm1\n\t"
     "addsd %1, %%xmm1\n\t"
     "movsd %%xmm1, (%0)\n\t"
     : /* no output operands */
     : "r" (val), "x" (1.0)
     : "xmm1", "memory");

The "r" (val) input operand tells the compiler to put val into a general purpose register and then substitute that register name into %0 where ever it appears in the string. "r" (val)输入操作数告诉编译器将val放入通用寄存器中，然后将该寄存器名称替换为%0出现在字符串中的任何位置。 Similarly the "x" (1.0) tell the compiler to put 1.0 into an XMM register, substituting it for %1 . 类似地， "x" (1.0)告诉编译器将1.0放入XMM寄存器，用%1代替。 The clobbers tell the compiler that the XMM1 register is modified by the statement along with something in memory. Clobbers告诉编译器XMM1寄存器由该语句以及内存中的某些内容进行了修改。 You might also notice that I've swapped the operands on ADDSD so that only one register is modified by the statement. 您可能还会注意到，我已经在ADDSD上交换了操作数，因此该语句仅修改了一个寄存器。

And here's the generated assembly when I compile it the version of GCC I have installed on my computer: 这是编译我在计算机上安装的GCC版本时生成的程序集：

foo:
    pushq   %rbp
    movq    %rsp, %rbp
    movq    %rcx, 16(%rbp)
    movq    16(%rbp), %rax
    movsd   .LC2(%rip), %xmm0

/APP
    movq (%rax), %xmm1
    addsd %xmm0, %xmm1
    movsd %xmm1, (%rax)
/NO_APP

    popq    %rbp
    ret

.LC2:
    .long   0
    .long   1072693248

Looks like my version of GCC decided to store val in 16(%rbp) instead of -8(%rbp) . 看来我的GCC版本决定将val存储在16(%rbp)而不是-8(%rbp) 。 Your code wasn't even portable to other versions of GCC, let alone Microsoft's compiler. 您的代码甚至无法移植到其他版本的GCC，更不用说Microsoft的编译器了。 Lets look at what I get when I compile it with optimization turned on: 让我们看看在启用优化的情况下进行编译时得到的结果：

foo:
    movsd   .LC0(%rip), %xmm0

/APP
    movq (%rcx), %xmm1
    addsd %xmm0, %xmm1
    movsd %xmm1, (%rcx)
/NO_APP

    ret

Look how short and sweet that function is. 看看该功能有多简短。 The compiler has eliminated all that unnecessary boiler plate code that setups the stack frame. 编译器消除了设置堆栈框架的所有不必要的样板代码。 Also since val is passed to the function in RCX, the compiler just uses that register in the inline assembly directly. 同样，由于将val传递给RCX中的函数，因此编译器仅直接在内联汇编中使用该寄存器。 No need to store it on the stack only to immediately load it back into another register. 无需将其存储在堆栈中，只需立即将其加载回另一个寄存器即可。

Of course, just with like your own code, none of this is remotely compatible with Microsoft's compiler. 当然，就像您自己的代码一样，这些都不能与Microsoft的编译器远程兼容。 They only way to make it compatible is not to use inline assembly at all. 他们使其兼容的唯一方法是根本不使用内联汇编。 Fortunately that's an option, and I don't just mean using *val + 1.0 . 幸运的是，这是一个选择，我不仅仅是使用*val + 1.0 。 To do this you need to use Intel's intrinsics , which are support both by GCC, Microsoft C/C++ along with Clang and Intel's own compiler. 为此，您需要使用Intel的内在函数，GCC，Microsoft C / C ++，Clang和Intel自己的编译器均支持Intel的内在函数。 Here's an example: 这是一个例子：

#include <emmintrin.h>

void foo(double *val) {
    __m128d a = _mm_load_sd(val);
    const double c = 1.0;
    __m128d b = _mm_load_sd(&c);
    a = _mm_add_sd(a, b);
    _mm_store_sd(val, a);
}

My compiler does something hideous with this when compiling without optimization, but here's what it looks like with optimization: 在不进行优化的情况下进行编译时，我的编译器对此做了一些令人毛骨悚然的事情，但是在进行优化时，它看起来像这样：

foo:
    movsd   (%rcx), %xmm0
    addsd   .LC0(%rip), %xmm0
    movlpd  %xmm0, (%rcx)
    ret

The compiler is smart enough to know that it can use the 1.0 constant stored in memory directly in the ADDSD instruction. 编译器非常聪明，知道它可以直接在ADDSD指令中使用存储在内存中的1.0常量。

Answer 2

If anyone is interested in the exact answer to my question, I'm also posting it here since I somehow managed to figure it out with sheer luck and trial/error. 如果有人对我的问题的确切答案感兴趣，我也将其张贴在这里，因为我以某种方式设法通过运气和审判/错误来弄清楚了它。 The whole point of this was to learn simple assembly. 这样做的全部目的是学习简单的组装。

void foo(double *in) {
   __asm__ (
      "movq -8(%rbp), %rax\n\t"
      "movq (%rax), %xmm1\n\t"
      "movq $0x3FF0000000000000, %rbx\n\t" 
      "movq %rbx, %xmm0\n\t"
      "addsd %xmm1, %xmm0\n\t"
      "movsd %xmm0, (%rax)\n\t"
   );
}

在AT＆T内联汇编中将float / double设置为恒定值

问题描述

2 个解决方案

解决方案1
0 已采纳 2015-06-02 03:31:52

解决方案2
0 2015-06-02 04:41:59

在AT＆T内联汇编中将float / double设置为恒定值

问题描述

2 个解决方案

解决方案1 0 已采纳 2015-06-02 03:31:52

解决方案2 0 2015-06-02 04:41:59

解决方案1
0 已采纳 2015-06-02 03:31:52

解决方案2
0 2015-06-02 04:41:59