There is a bool variable named "Enable", when "Enable" is false, I want to create following function:
void test_false()
{
float dst[4] = {1.0, 1.0, 1.0, 1.0};
float src[4] = {1.0, 2.0, 3.0, 4.0};
float * dst_addr = dst;
float * src_addr = src;
asm volatile (
"vld1.32 {q0}, [%[src]] \n"
"vld1.32 {q1}, [%[dst]] \n"
"vadd.f32 q0, q0, q1 \n"
"vadd.f32 q0, q0, q1 \n"
"vst1.32 {q0}, [%[dst]] \n"
:[src]"+r"(src_addr),
[dst]"+r"(dst_addr)
:
: "q0", "q1", "q2", "q3", "memory"
);
for (int i = 0; i < 4; i++)
{
printf("%f, ", dst[i]);//0.0 0.0 0.0 0.0
}
}
And when "Enable" is true, I want to create following function:
void test_true()
{
float dst[4] = {1.0, 1.0, 1.0, 1.0};
float src[4] = {1.0, 2.0, 3.0, 4.0};
float * dst_addr = dst;
float * src_addr = src;
asm volatile (
"vld1.32 {q0}, [%[src]] \n"
"vld1.32 {q1}, [%[dst]] \n"
"vadd.f32 q0, q0, q1 \n"
"vadd.f32 q0, q0, q1 \n"
"vadd.f32 q0, q0, q1 \n" //Only here is different from test_false()
"vst1.32 {q0}, [%[dst]] \n"
:[src]"+r"(src_addr),
[dst]"+r"(dst_addr)
:
: "q0", "q1", "q2", "q3", "memory"
);
for (int i = 0; i < 4; i++)
{
printf("%f, ", dst[i]);//0.0 0.0 0.0 0.0
}
}
But I don't want to save two copies of code, because most of them are the same. I want to use “c++ Template + Conditional Compile” to solve my problem. The code is as follows. But it didn't work. Whether the Enable is true or false, the compiler creates the code same as test_true().
template<bool Enable>
void test_tmp()
{
float dst[4] = {1.0, 1.0, 1.0, 1.0};
float src[4] = {1.0, 2.0, 3.0, 4.0};
float * dst_addr = dst;
float * src_addr = src;
if (Enable)
{
#define FUSE_
}
asm volatile (
"vld1.32 {q0}, [%[src]] \n"
"vld1.32 {q1}, [%[dst]] \n"
"vadd.f32 q0, q0, q1 \n"
"vadd.f32 q0, q0, q1 \n"
#ifdef FUSE_
"vadd.f32 q0, q0, q1 \n"
#endif
"vst1.32 {q0}, [%[dst]] \n"
:[src]"+r"(src_addr),
[dst]"+r"(dst_addr)
:
: "q0", "q1", "q2", "q3", "memory"
);
for (int i = 0; i < 4; i++)
{
printf("%f, ", dst[i]);//0.0 0.0 0.0 0.0
}
#undef FUSE_
}
template void test_tmp<true>();
template void test_tmp<false>();
It doesn't seem possible to write code like function test_tmp(). Does anyone know how to solve my problem? Thanks a lot.
If you use C temporaries and output operands for all live registers in the first half that line up with input constraints for the 2nd half, you should be able to split it up your inline asm without any performance loss, especially if you use specific memory input/output constraints instead of a catch-all "memory"
clobber. But it will get a lot more complicated.
This obviously doesn't work, because the C preprocessor runs before the C++ compiler even looks at if()
statements.
if (Enable) {
#define FUSE_ // always defined, regardless of Enable
}
But the GNU assembler has its own macro / conditional-assembly directives like .if
which operate on the asm the compiler emits after making text substitutions into the asm()
template, including actual numeric values for immediate input operands.
bool
as an input operand for an assembler .if
directive Use an "i" (Enable)
input constraint. Normally the %0
or %[enable]
expansion of that would be #0
or #1
, because that's how to print an ARM immediate. But GCC has a %c0
/ %c[enable]
modifier that will print a constant without punctuation. (It's documented for x86 , but works the same way for ARM and presumably all other architectures. Documentation for ARM / AArch64 operand modifiers is being worked on; I've been sitting on an email about that...)
".if %c[enable] \\n\\t"
for [enable] "i" (c_var)
will substitute as .if 0
or .if 1
into the inline-asm template, exactly what we need to make .if
/ .endif
work at assemble time.
Full example:
template<bool Enable>
void test_tmp(float dst[4])
{
//float dst[4] = {1.0, 1.0, 1.0, 1.0};
// static const // non-static-const so we can see the memory clobber vs. dummy src stop this from optimizing away init of src[] on the stack
float src[4] = {1.0, 2.0, 3.0, 4.0};
float * dst_addr = dst;
const float * src_addr = src;
asm (
"vld1.32 {q1}, [%[dst]] @ dummy dst = %[dummy_memdst]\n" // hopefully they pick the same regs?
"vld1.32 {q0}, [%[src]] @ dummy src = %[dummy_memsrc]\n"
"vadd.f32 q0, q0, q1 \n" // TODO: optimize to q1+q1 first, without a dep on src
"vadd.f32 q0, q0, q1 \n" // allowing q0+=q1 and q1+=q1 in parallel if we need q0 += 3*q1
// #ifdef FUSE_
".if %c[enable]\n" // %c modifier: print constant without punctuation, same as documented for x86
"vadd.f32 q0, q0, q1 \n"
".endif \n"
// #endif
"vst1.32 {q0}, [%[dst]] \n"
: [dummy_memdst] "+m" (*(float(*)[4])dst_addr)
: [src]"r"(src_addr),
[dst]"r"(dst_addr),
[enable]"i"(Enable)
, [dummy_memsrc] "m" (*(const float(*)[4])src_addr)
: "q0", "q1", "q2", "q3" //, "memory"
);
/*
for (int i = 0; i < 4; i++)
{
printf("%f, ", dst[i]);//0.0 0.0 0.0 0.0
}
*/
}
float dst[4] = {1.0, 1.0, 1.0, 1.0};
template void test_tmp<true>(float *);
template void test_tmp<false>(float *);
compiles with GCC and Clang on the Godbolt compiler explorer
With gcc, you only get the compiler's .s
output, so you have to turn off some of the usual compiler-explorer filters and look through the directives. All 3 vadd.f32
instructions are there in the false
version, but one of them is surrounded by .if 0
/ .endif
.
But clang's built-in assembler processes assembler directives internally before turning things back into asm if that output is requested. (Normally clang/LLVM goes straight to machine code, unlike gcc which always runs a separate assembler).
Just to be clear, this works with gcc and clang, but it's just easier to see it on Godbolt with clang. (Because Godbolt doesn't have a "binary" mode that actually assembles and then disassembles, except for x86). Clang output for the false
version
...
vld1.32 {d2, d3}, [r0] @ dummy dst = [r0]
vld1.32 {d0, d1}, [r1] @ dummy src = [r1]
vadd.f32 q0, q0, q1
vadd.f32 q0, q0, q1
vst1.32 {d0, d1}, [r0]
...
Notice that clang picked the same GP register for the raw pointers as it used for the memory operand. (gcc seems to choose [sp]
for src_mem, but a different reg for the pointer input that you use manually inside an addressing mode). If you hadn't forced it to have the pointers in registers, it could have used an SP-relative addressing mode with an offset for the vector loads, potentially taking advantage of ARM addressing modes.
If you're really not going to modify the pointers inside the asm (eg with post-increment addressing modes), then "r"
input-only operands makes the most sense. If we'd left in the printf
loop, the compiler would have needed dst
again after the asm, so it would benefit from having it still in a register. A "+r"(dst_addr)
input forces the compiler to assume that that register is no longer usable as a copy of dst
. Anyway, gcc always copies the registers, even when it doesn't need it later, whether I make it "r"
or "+r"
, so that's weird.
Using (dummy) memory inputs / outputs means we can drop the volatile
, so the compiler can optimize it normally as a pure function of its inputs. (And optimize it away if the result is unused.)
Hopefully this isn't worse code-gen that with the "memory"
clobber. But it would probably be better if you just used the "=m"
and "m"
memory operands, and didn't ask for pointers in registers at all. (That doesn't help if you're going to loop over the array with inline asm, though.)
I haven't been doing ARM assembly for few years, and I never really bothered to learn GCC inline assembly properly, but I think your code can be rewritten like this, using intrinsics:
#include <cstdio>
#include <arm_neon.h>
template<bool Enable>
void test_tmp()
{
const float32x4_t src = {1.0, 2.0, 3.0, 4.0};
const float32x4_t src2 = {1.0, 1.0, 1.0, 1.0};
float32x4_t z;
z = vaddq_f32(src, src2);
z = vaddq_f32(z, src2);
if (Enable) z = vaddq_f32(z, src2);
float result[4];
vst1q_f32(result, z);
for (int i = 0; i < 4; i++)
{
printf("%f, ", result[i]);//0.0 0.0 0.0 0.0
}
}
template void test_tmp<true>();
template void test_tmp<false>();
You can see resulting machine code + toy around live at: https://godbolt.org/z/Fg7Tci
Compiled with ARM gcc8.2 and command line options "-O3 -mfloat-abi=softfp -mfpu=neon" the "true" variant is:
void test_tmp<true>():
vmov.f32 q9, #1.0e+0 @ v4sf
vldr d16, .L6
vldr d17, .L6+8
# and the FALSE variant has one less vadd.f32 in this part
vadd.f32 q8, q8, q9
vadd.f32 q8, q8, q9
vadd.f32 q8, q8, q9
push {r4, r5, r6, lr}
sub sp, sp, #16
vst1.32 {d16-d17}, [sp:64]
mov r4, sp
ldr r5, .L6+16
add r6, sp, #16
.L2:
vldmia.32 r4!, {s15}
vcvt.f64.f32 d16, s15
mov r0, r5
vmov r2, r3, d16
bl printf
cmp r4, r6
bne .L2
add sp, sp, #16
pop {r4, r5, r6, pc}
.L6:
.word 1065353216
.word 1073741824
.word 1077936128
.word 1082130432
.word .LC0
.LC0:
.ascii "%f, \000"
This still leaves me profoundly confused by why the gcc doesn't simply calculate final string with values as string for output, as the inputs are constant. Maybe it's some math-rule about precision preventing it to do that in compile-time as the result could differ slightly from actual target HW platform FPU? Ie with some fast-math switch it would probably drop that code completely and just produce one output string...
But I guess your code is not actually proper "MCVE" of what you are doing, and the test values would be fed into some real function you are testing, or something like that.
Anyway, if you are working on performance optimizations, you should probably rather avoid inline assembly completely and use intrinsics instead, as that allows the compiler to better allocate registers and optimize code around the calculations (I didn't track it precisely, but I think the last version of this experiment in godbolt was 2-4 instructions shorter/simpler than the original using inline assembly).
Plus you will avoid the incorrect asm constraints like your example code has, those are always tricky to get correctly and pure PITA to maintain if you keep modifying the inlined code often.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.