简体   繁体   中英

How to use c++ template to conditionally compile asm code?

There is a bool variable named "Enable", when "Enable" is false, I want to create following function:

void test_false()
{
   float dst[4] = {1.0, 1.0, 1.0, 1.0};
   float src[4] = {1.0, 2.0, 3.0, 4.0};
   float * dst_addr = dst;
   float * src_addr = src;


   asm volatile (
                 "vld1.32    {q0}, [%[src]]  \n"
                 "vld1.32    {q1}, [%[dst]]  \n"
                 "vadd.f32   q0, q0, q1      \n"
                 "vadd.f32   q0, q0, q1      \n"
                 "vst1.32    {q0}, [%[dst]]  \n"
                 :[src]"+r"(src_addr),
                 [dst]"+r"(dst_addr)
                 :
                 : "q0", "q1", "q2", "q3", "memory"
                 );

   for (int i = 0; i < 4; i++)
   {
       printf("%f, ", dst[i]);//0.0  0.0  0.0  0.0
   }
}

And when "Enable" is true, I want to create following function:

void test_true()
{
   float dst[4] = {1.0, 1.0, 1.0, 1.0};
   float src[4] = {1.0, 2.0, 3.0, 4.0};
   float * dst_addr = dst;
   float * src_addr = src;


   asm volatile (
                 "vld1.32    {q0}, [%[src]]  \n"
                 "vld1.32    {q1}, [%[dst]]  \n"
                 "vadd.f32   q0, q0, q1      \n"
                 "vadd.f32   q0, q0, q1      \n"
                 "vadd.f32   q0, q0, q1      \n" //Only here is different from test_false()
                 "vst1.32    {q0}, [%[dst]]  \n"
                 :[src]"+r"(src_addr),
                 [dst]"+r"(dst_addr)
                 :
                 : "q0", "q1", "q2", "q3", "memory"
                 );

   for (int i = 0; i < 4; i++)
   {
       printf("%f, ", dst[i]);//0.0  0.0  0.0  0.0
   }
}

But I don't want to save two copies of code, because most of them are the same. I want to use “c++ Template + Conditional Compile” to solve my problem. The code is as follows. But it didn't work. Whether the Enable is true or false, the compiler creates the code same as test_true().

template<bool Enable>
void test_tmp()
{
   float dst[4] = {1.0, 1.0, 1.0, 1.0};
   float src[4] = {1.0, 2.0, 3.0, 4.0};
   float * dst_addr = dst;
   float * src_addr = src;

    if (Enable)
    {
        #define FUSE_
    }

   asm volatile (
                 "vld1.32    {q0}, [%[src]]  \n"
                 "vld1.32    {q1}, [%[dst]]  \n"
                 "vadd.f32   q0, q0, q1          \n"
                 "vadd.f32   q0, q0, q1          \n"

                 #ifdef FUSE_
                 "vadd.f32   q0, q0, q1          \n"
                 #endif

                 "vst1.32    {q0}, [%[dst]]  \n"
                 :[src]"+r"(src_addr),
                 [dst]"+r"(dst_addr)
                 :
                 : "q0", "q1", "q2", "q3", "memory"
                 );



   for (int i = 0; i < 4; i++)
   {
       printf("%f, ", dst[i]);//0.0  0.0  0.0  0.0
   }

   #undef FUSE_
}


template void test_tmp<true>();
template void test_tmp<false>();

It doesn't seem possible to write code like function test_tmp(). Does anyone know how to solve my problem? Thanks a lot.

If you use C temporaries and output operands for all live registers in the first half that line up with input constraints for the 2nd half, you should be able to split it up your inline asm without any performance loss, especially if you use specific memory input/output constraints instead of a catch-all "memory" clobber. But it will get a lot more complicated.


This obviously doesn't work, because the C preprocessor runs before the C++ compiler even looks at if() statements.

if (Enable) {
    #define FUSE_    // always defined, regardless of Enable
}

But the GNU assembler has its own macro / conditional-assembly directives like .if which operate on the asm the compiler emits after making text substitutions into the asm() template, including actual numeric values for immediate input operands.

Use the bool as an input operand for an assembler .if directive

Use an "i" (Enable) input constraint. Normally the %0 or %[enable] expansion of that would be #0 or #1 , because that's how to print an ARM immediate. But GCC has a %c0 / %c[enable] modifier that will print a constant without punctuation. (It's documented for x86 , but works the same way for ARM and presumably all other architectures. Documentation for ARM / AArch64 operand modifiers is being worked on; I've been sitting on an email about that...)

".if %c[enable] \\n\\t" for [enable] "i" (c_var) will substitute as .if 0 or .if 1 into the inline-asm template, exactly what we need to make .if / .endif work at assemble time.

Full example:

template<bool Enable>
void test_tmp(float dst[4])
{
   //float dst[4] = {1.0, 1.0, 1.0, 1.0};
   // static const    // non-static-const so we can see the memory clobber vs. dummy src stop this from optimizing away init of src[] on the stack
   float src[4] = {1.0, 2.0, 3.0, 4.0};
   float * dst_addr = dst;
   const float * src_addr = src;

   asm (
                 "vld1.32    {q1}, [%[dst]]  @ dummy dst = %[dummy_memdst]\n" // hopefully they pick the same regs?
                 "vld1.32    {q0}, [%[src]]  @ dummy src = %[dummy_memsrc]\n"
                 "vadd.f32   q0, q0, q1          \n"  // TODO: optimize to q1+q1 first, without a dep on src
                 "vadd.f32   q0, q0, q1          \n"  // allowing q0+=q1 and q1+=q1 in parallel if we need q0 += 3*q1
//                 #ifdef FUSE_
                ".if %c[enable]\n"    // %c modifier: print constant without punctuation, same as documented for x86
                 "vadd.f32   q0, q0, q1          \n"
                 ".endif \n"
//                 #endif

                 "vst1.32    {q0}, [%[dst]]  \n"
                 : [dummy_memdst] "+m" (*(float(*)[4])dst_addr)
                 : [src]"r"(src_addr),
                   [dst]"r"(dst_addr),
                   [enable]"i"(Enable)
                  , [dummy_memsrc] "m" (*(const float(*)[4])src_addr)
                 : "q0", "q1", "q2", "q3" //, "memory"
                 );


/*
   for (int i = 0; i < 4; i++)
   {
       printf("%f, ", dst[i]);//0.0  0.0  0.0  0.0
   }
*/
}

float dst[4] = {1.0, 1.0, 1.0, 1.0};
template void test_tmp<true>(float *);
template void test_tmp<false>(float *);

compiles with GCC and Clang on the Godbolt compiler explorer

With gcc, you only get the compiler's .s output, so you have to turn off some of the usual compiler-explorer filters and look through the directives. All 3 vadd.f32 instructions are there in the false version, but one of them is surrounded by .if 0 / .endif .

But clang's built-in assembler processes assembler directives internally before turning things back into asm if that output is requested. (Normally clang/LLVM goes straight to machine code, unlike gcc which always runs a separate assembler).

Just to be clear, this works with gcc and clang, but it's just easier to see it on Godbolt with clang. (Because Godbolt doesn't have a "binary" mode that actually assembles and then disassembles, except for x86). Clang output for the false version

 ...

    vld1.32 {d2, d3}, [r0]    @ dummy dst = [r0]
    vld1.32 {d0, d1}, [r1]    @ dummy src = [r1]
    vadd.f32        q0, q0, q1
    vadd.f32        q0, q0, q1
    vst1.32 {d0, d1}, [r0]

 ... 

Notice that clang picked the same GP register for the raw pointers as it used for the memory operand. (gcc seems to choose [sp] for src_mem, but a different reg for the pointer input that you use manually inside an addressing mode). If you hadn't forced it to have the pointers in registers, it could have used an SP-relative addressing mode with an offset for the vector loads, potentially taking advantage of ARM addressing modes.

If you're really not going to modify the pointers inside the asm (eg with post-increment addressing modes), then "r" input-only operands makes the most sense. If we'd left in the printf loop, the compiler would have needed dst again after the asm, so it would benefit from having it still in a register. A "+r"(dst_addr) input forces the compiler to assume that that register is no longer usable as a copy of dst . Anyway, gcc always copies the registers, even when it doesn't need it later, whether I make it "r" or "+r" , so that's weird.

Using (dummy) memory inputs / outputs means we can drop the volatile , so the compiler can optimize it normally as a pure function of its inputs. (And optimize it away if the result is unused.)

Hopefully this isn't worse code-gen that with the "memory" clobber. But it would probably be better if you just used the "=m" and "m" memory operands, and didn't ask for pointers in registers at all. (That doesn't help if you're going to loop over the array with inline asm, though.)

See also Looping over arrays with inline assembly

I haven't been doing ARM assembly for few years, and I never really bothered to learn GCC inline assembly properly, but I think your code can be rewritten like this, using intrinsics:

#include <cstdio>
#include <arm_neon.h>

template<bool Enable>
void test_tmp()
{
    const float32x4_t src = {1.0, 2.0, 3.0, 4.0};
    const float32x4_t src2 = {1.0, 1.0, 1.0, 1.0};
    float32x4_t z;

    z = vaddq_f32(src, src2);
    z = vaddq_f32(z, src2);
    if (Enable) z = vaddq_f32(z, src2);
    float result[4];
    vst1q_f32(result, z);
    for (int i = 0; i < 4; i++)
    {
        printf("%f, ", result[i]);//0.0  0.0  0.0  0.0
    }
}

template void test_tmp<true>();
template void test_tmp<false>();

You can see resulting machine code + toy around live at: https://godbolt.org/z/Fg7Tci

Compiled with ARM gcc8.2 and command line options "-O3 -mfloat-abi=softfp -mfpu=neon" the "true" variant is:

void test_tmp<true>():
        vmov.f32        q9, #1.0e+0  @ v4sf
        vldr    d16, .L6
        vldr    d17, .L6+8
        # and the FALSE variant has one less vadd.f32 in this part
        vadd.f32        q8, q8, q9
        vadd.f32        q8, q8, q9
        vadd.f32        q8, q8, q9
        push    {r4, r5, r6, lr}
        sub     sp, sp, #16
        vst1.32 {d16-d17}, [sp:64]
        mov     r4, sp
        ldr     r5, .L6+16
        add     r6, sp, #16
.L2:
        vldmia.32       r4!, {s15}
        vcvt.f64.f32    d16, s15
        mov     r0, r5
        vmov    r2, r3, d16
        bl      printf
        cmp     r4, r6
        bne     .L2
        add     sp, sp, #16
        pop     {r4, r5, r6, pc}

.L6:
        .word   1065353216
        .word   1073741824
        .word   1077936128
        .word   1082130432
        .word   .LC0

.LC0:
        .ascii  "%f, \000"

This still leaves me profoundly confused by why the gcc doesn't simply calculate final string with values as string for output, as the inputs are constant. Maybe it's some math-rule about precision preventing it to do that in compile-time as the result could differ slightly from actual target HW platform FPU? Ie with some fast-math switch it would probably drop that code completely and just produce one output string...

But I guess your code is not actually proper "MCVE" of what you are doing, and the test values would be fed into some real function you are testing, or something like that.

Anyway, if you are working on performance optimizations, you should probably rather avoid inline assembly completely and use intrinsics instead, as that allows the compiler to better allocate registers and optimize code around the calculations (I didn't track it precisely, but I think the last version of this experiment in godbolt was 2-4 instructions shorter/simpler than the original using inline assembly).

Plus you will avoid the incorrect asm constraints like your example code has, those are always tricky to get correctly and pure PITA to maintain if you keep modifying the inlined code often.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM