Adding two double precision floats in assembly language in C on a Raspberry Pi 4 with 64 bit Linux

Question

I am learning ARMV8 assembly language on my raspberry pi 4 and I want to know the simplest way that I can add two floats whilst choosing which registers I use to store the operands.

I had hoped that this code would add the values stored in variables d1 and d2 and then store the sum in the variable result.

#include <stdio.h>
#include <stdlib.h>
int
main()
{
        double d1 = 0.34543;
        double d2 = 1.0;
        double result = 0;
        asm volatile("ldr d1, %1\n\t"
                     "ldr d2, %2\n\t"
                     "fadd d2, d1, d2\n\t"
                     "str d2, %0": "=g" (result) : "g" (d1), "g" (d2)
                    );
        printf("%f + %f = %f", d1, d2, result);
}

Instead when I run

gcc test.c

to compile the above code snippet which I saved in test.c I get the error:

/tmp/ccdcVUbH.s: Assembler messages:
/tmp/ccdcVUbH.s:31: Error: invalid addressing mode at operand 2 -- `str d2,x0'

When I change the code to this:

#include <stdio.h>
#include <stdlib.h>
int
main()
{
        double d1 = 0.34543;
        double d2 = 1.0;
        double result = 0;
        printf("%f + %f", d1, d2);
        asm volatile("ldr d1, %1\n\t"
                     "ldr d2, %2\n\t"
                     "fadd d2, d1, d2\n\t"
                     "str d2, %2": "=g" (result) : "g" (d1), "g" (d2)
                    );
        printf(" = %f", d2);
}

I am able to compile and run and get the correct answer but it troubles me that the first code snippet does not compile and I would like to know why.

Answer 1

The g constraint, as the documentation explains, allows the compiler to insert into the asm a string that refers to either a register (like x1 ) or a memory reference ( [x2] or [sp, 24] or the like), or even an immediate ( #17 ). This is nice for CISC architectures where there are instructions that can accept any of the above (eg x86 can do add %eax, %ebx or add 24(%rsp), %ebx or add $17, %ebx ), but it is useless for a load-store RISC architecture like ARM, because there aren't any instructions where memory and registers can be used interchangeably. Arithmetic instructions like add, sub, and only operate on registers, and load/store instructions ( ldr / str ) only accept memory references.

If you're going to write ldr / str in your asm, then the corresponding operand needs to be a memory reference: m constraint.

Another issue is that when you modify an explicitly chosen register in your asm code, you need to notify the compiler of this by declaring a clobber . Otherwise the compiler may keep important data in that register and not know that it has been modified. This can lead to very subtle, unpredictable, and catastrophic bugs, that may only show up under particular combinations of optimization options and surrounding code. It's one of the major pitfalls of inline assembly programming, and why many people say you should not use inline assembly at all unless there is an extremely good reason for it.

So, a corrected version would look like

asm ("ldr d1, %1\n\t"
     "ldr d2, %2\n\t"
     "fadd d2, d1, d2\n\t"
     "str d2, %0"
     : "=m" (result)
     : "m" (d1), "m" (d2)
     : "d1", "d2" // clobbers
    );

By the way, volatile isn't needed for code that only computes outputs as a pure function of its inputs, without side effects on the machine's state. It inhibits the compiler from optimizing out the asm statement if its outputs are unused. But in this case, if you changed your code in such a way that result wasn't used anymore, it would be a good thing for the compiler to drop the dead asm code that computes it.

Now the code works correctly, but it is still inefficient. You explicitly load your registers from memory, and this means the compiler needs to ensure that the values of those variables are actually in memory - even if they were already in a register before that, It ends up generating store instructions before the asm block. just so that you can do a load to get the same value right back: The same on the other end, you store to memory. and the compiler has to turn around and load again. It's a waste of instructions and memory bandwidth. See the generated asm , lines 11-13 and 15,17.

The whole point of extended asm is that you specify constraints to tell the compiler where you really want the data, and it arranges everything accordingly. You don't really want the data in memory if you're going to do an fadd - you want it in registers. So tell the compiler that.

The constraint for an ARM64 floating-point or SIMD register is w . However, by default this will emit the v name of the register into the generated assembly: v0, v1 , etc, whereas you want d0, d1 for its low 64 bits. You fix this with template modifiers . GCC doesn't explicitly document its support for these, as far as I know, but it does follow armclang's documentation as best I can tell. The d modifier is what we need here:

asm ("fadd %d0, %d1, %d2\n\t" 
     : "=w" (result) 
     : "w" (d1), "w" (d2)
    );

This way:

The code is much shorter
You do not need to manually choose which three registers to use; the compiler chooses for you
If the values are already in registers, the compiler can just choose the registers where they already are, avoiding unnecessary fmov s. If the values are in memory, the compiler will generate loads and stores, but only if needed. You'll never have redundant load/store combinations
No clobbers needed because you don't modify any explicitly named registers; only the output operand %d0 , and the compiler obviously can tell that you've modified it, because it's an output.

See the generated asm . Note indeed that stack memory is no longer used at all.

Adding two double precision floats in assembly language in C on a Raspberry Pi 4 with 64 bit Linux

Question

1 answers

solution1
3 ACCPTED 2021-08-11 02:28:01

Adding two double precision floats in assembly language in C on a Raspberry Pi 4 with 64 bit Linux

Question

1 answers

solution1 3 ACCPTED 2021-08-11 02:28:01

solution1
3 ACCPTED 2021-08-11 02:28:01