在 64 位 Linux 的 Raspberry Pi 4 上的 C 中用汇编语言添加两个双精度浮点数

Question

I am learning ARMV8 assembly language on my raspberry pi 4 and I want to know the simplest way that I can add two floats whilst choosing which registers I use to store the operands.我正在我的树莓派 4 上学习 ARMV8 汇编语言，我想知道在选择用于存储操作数的寄存器时添加两个浮点数的最简单方法。

I had hoped that this code would add the values stored in variables d1 and d2 and then store the sum in the variable result.我曾希望此代码会将存储在变量 d1 和 d2 中的值相加，然后将总和存储在变量结果中。

#include <stdio.h>
#include <stdlib.h>
int
main()
{
        double d1 = 0.34543;
        double d2 = 1.0;
        double result = 0;
        asm volatile("ldr d1, %1\n\t"
                     "ldr d2, %2\n\t"
                     "fadd d2, d1, d2\n\t"
                     "str d2, %0": "=g" (result) : "g" (d1), "g" (d2)
                    );
        printf("%f + %f = %f", d1, d2, result);
}

Instead when I run相反，当我跑步时

gcc test.c

to compile the above code snippet which I saved in test.c I get the error:编译我保存在 test.c 中的上述代码片段，我得到错误：

/tmp/ccdcVUbH.s: Assembler messages:
/tmp/ccdcVUbH.s:31: Error: invalid addressing mode at operand 2 -- `str d2,x0'

When I change the code to this:当我将代码更改为：

#include <stdio.h>
#include <stdlib.h>
int
main()
{
        double d1 = 0.34543;
        double d2 = 1.0;
        double result = 0;
        printf("%f + %f", d1, d2);
        asm volatile("ldr d1, %1\n\t"
                     "ldr d2, %2\n\t"
                     "fadd d2, d1, d2\n\t"
                     "str d2, %2": "=g" (result) : "g" (d1), "g" (d2)
                    );
        printf(" = %f", d2);
}

I am able to compile and run and get the correct answer but it troubles me that the first code snippet does not compile and I would like to know why.我能够编译并运行并得到正确的答案，但令我困扰的是第一个代码片段无法编译，我想知道为什么。

Answer 1

The g constraint, as the documentation explains, allows the compiler to insert into the asm a string that refers to either a register (like x1 ) or a memory reference ( [x2] or [sp, 24] or the like), or even an immediate ( #17 ).如文档所述， g约束允许编译器将一个字符串插入到 asm 中，该字符串引用寄存器（如x1 ）或 memory 引用（ [x2]或[sp, 24]等），甚至立即数（ #17 ）。 This is nice for CISC architectures where there are instructions that can accept any of the above (eg x86 can do add %eax, %ebx or add 24(%rsp), %ebx or add $17, %ebx ), but it is useless for a load-store RISC architecture like ARM, because there aren't any instructions where memory and registers can be used interchangeably.这对于 CISC 架构来说很好，其中有可以接受上述任何指令的指令（例如 x86 可以add %eax, %ebx或add 24(%rsp), %ebx或add $17, %ebx ），但它是无用的对于像 ARM 这样的加载存储 RISC 架构，因为没有任何指令可以互换使用 memory 和寄存器。 Arithmetic instructions like add, sub, and only operate on registers, and load/store instructions ( ldr / str ) only accept memory references. add, sub, and减法等算术指令仅对寄存器进行操作，加载/存储指令 ( ldr / str ) 仅接受 memory 引用。

If you're going to write ldr / str in your asm, then the corresponding operand needs to be a memory reference: m constraint.如果您要在 asm 中编写ldr / str ，则相应的操作数需要是 memory reference: m约束。

Another issue is that when you modify an explicitly chosen register in your asm code, you need to notify the compiler of this by declaring a clobber .另一个问题是，当您修改 asm 代码中明确选择的寄存器时，您需要通过声明一个clobber来通知编译器。 Otherwise the compiler may keep important data in that register and not know that it has been modified.否则编译器可能会将重要数据保存在该寄存器中而不知道它已被修改。 This can lead to very subtle, unpredictable, and catastrophic bugs, that may only show up under particular combinations of optimization options and surrounding code.这可能会导致非常微妙、不可预测和灾难性的错误，这些错误可能只会在优化选项和周围代码的特定组合下出现。 It's one of the major pitfalls of inline assembly programming, and why many people say you should not use inline assembly at all unless there is an extremely good reason for it.这是内联汇编编程的主要缺陷之一，也是为什么许多人说您根本不应该使用内联汇编，除非有非常充分的理由。

So, a corrected version would look like所以，更正后的版本看起来像

asm ("ldr d1, %1\n\t"
     "ldr d2, %2\n\t"
     "fadd d2, d1, d2\n\t"
     "str d2, %0"
     : "=m" (result)
     : "m" (d1), "m" (d2)
     : "d1", "d2" // clobbers
    );

By the way, volatile isn't needed for code that only computes outputs as a pure function of its inputs, without side effects on the machine's state. It inhibits the compiler from optimizing out the asm statement if its outputs are unused.顺便说一句，仅将输出计算为其输入的纯 function 而不会对机器的 state 产生副作用的代码不需要volatile 。如果其输出未使用，它会阻止编译器优化 asm 语句。 But in this case, if you changed your code in such a way that result wasn't used anymore, it would be a good thing for the compiler to drop the dead asm code that computes it.但是在这种情况下，如果您以不再使用result的方式更改代码，那么编译器删除计算它的死 asm 代码将是一件好事。

Now the code works correctly, but it is still inefficient.现在代码可以正常工作，但仍然效率低下。 You explicitly load your registers from memory, and this means the compiler needs to ensure that the values of those variables are actually in memory - even if they were already in a register before that, It ends up generating store instructions before the asm block.您显式地从 memory 加载寄存器，这意味着编译器需要确保这些变量的值实际上在memory 中——即使它们之前已经在寄存器中，它最终会在 asm 块之前生成存储指令。 just so that you can do a load to get the same value right back: The same on the other end, you store to memory. and the compiler has to turn around and load again.这样您就可以加载以立即返回相同的值：另一端相同，您存储到 memory。编译器必须返回并再次加载。 It's a waste of instructions and memory bandwidth.这是对指令和 memory 带宽的浪费。 See the generated asm , lines 11-13 and 15,17.请参阅生成的 asm ，第 11-13 行和第 15,17 行。

The whole point of extended asm is that you specify constraints to tell the compiler where you really want the data, and it arranges everything accordingly.扩展 asm 的全部要点是您指定约束以告诉编译器您真正需要数据的位置，并且它会相应地安排所有内容。 You don't really want the data in memory if you're going to do an fadd - you want it in registers.如果你打算做一个时尚，你并不真的想要fadd中的数据 - 你想要它在寄存器中。 So tell the compiler that.所以告诉编译器。

The constraint for an ARM64 floating-point or SIMD register is w . ARM64 浮点或 SIMD 寄存器的约束是w 。 However, by default this will emit the v name of the register into the generated assembly: v0, v1 , etc, whereas you want d0, d1 for its low 64 bits.但是，默认情况下，这会将寄存器的v名称发送到生成的程序集中： v0, v1等，而您需要d0, d1作为其低 64 位。 You fix this with template modifiers .您可以使用模板修饰符来解决这个问题。 GCC doesn't explicitly document its support for these, as far as I know, but it does follow armclang's documentation as best I can tell.据我所知，GCC 没有明确记录其对这些的支持，但据我所知，它确实遵循了 armclang 的文档。 The d modifier is what we need here: d修饰符是我们在这里需要的：

asm ("fadd %d0, %d1, %d2\n\t" 
     : "=w" (result) 
     : "w" (d1), "w" (d2)
    );

This way:这边走：

The code is much shorter代码要短得多
You do not need to manually choose which three registers to use;您无需手动选择使用哪三个寄存器； the compiler chooses for you编译器为你选择
If the values are already in registers, the compiler can just choose the registers where they already are, avoiding unnecessary fmov s.如果值已经在寄存器中，编译器可以只选择它们已经存在的寄存器，避免不必要的fmov s。 If the values are in memory, the compiler will generate loads and stores, but only if needed.如果值在 memory 中，编译器将生成加载和存储，但仅在需要时生成。 You'll never have redundant load/store combinations您永远不会有冗余的加载/存储组合
No clobbers needed because you don't modify any explicitly named registers;不需要 clobbers，因为您不修改任何明确命名的寄存器； only the output operand %d0 , and the compiler obviously can tell that you've modified it, because it's an output.只有 output 操作数%d0 ，编译器显然可以告诉你已经修改了它，因为它是一个 output。

See the generated asm .查看生成的 asm 。 Note indeed that stack memory is no longer used at all.请注意，堆栈 memory 根本不再使用。

在 64 位 Linux 的 Raspberry Pi 4 上的 C 中用汇编语言添加两个双精度浮点数

问题描述

1 个解决方案

解决方案1
3 已采纳 2021-08-11 02:28:01

在 64 位 Linux 的 Raspberry Pi 4 上的 C 中用汇编语言添加两个双精度浮点数

问题描述

1 个解决方案

解决方案1 3 已采纳 2021-08-11 02:28:01

解决方案1
3 已采纳 2021-08-11 02:28:01