Problem description
I'm trying to design the C code unpacking array A
of uint32_t elements to array B
of uint32_t elements where each element of A
is unpacked to two consecutive elements of B
so that B[2*i]
contains low 16 bits of A[i]
and B[2*i + 1]
contains high 16 bits of A[i]
shifted right, ie,
B[2*i] = A[i] & 0xFFFFul;
B[2*i+1] = A[i] >> 16u;
Note the arrays are aligned to 4, have variable length, but A
always contains multiple of 4 of uint32_t and the size is <= 32, B
has sufficient space for unpacking and we are on ARM Cortex-M3.
Current bad solution in GCC inline asm
As the GCC is not good in optimizing this unpacking, I wrote unrolled C & inline asm to make it speed optimized with acceptable code size and register usage. The unrolled code looks like this:
static void unpack(uint32_t * src, uint32_t * dst, uint8_t nmb8byteBlocks)
{
switch(nmb8byteBlocks) {
case 8:
UNPACK(src, dst)
case 7:
UNPACK(src, dst)
...
case 1:
UNPACK(src, dst)
default:;
}
}
where
#define UNPACK(src, dst) \
asm volatile ( \
"ldm %0!, {r2, r4} \n\t" \
"lsrs r3, r2, #16 \n\t" \
"lsrs r5, r4, #16 \n\t" \
"stm %1!, {r2-r5} \n\t" \
: \
: "r" (src), "r" (dst) \
: "r2", "r3", "r4", "r5" \
);
It works until the GCC's optimizer decides to inline the function (wanted property) and reuse register variables src
and dst
in the next code. Clearly, due to the ldm %0!
and stm %1!
instructions the src
and dst
contain different addresses when leaving switch statement.
How to solve it?
I do not know how to inform GCC that registers used for src
and dst
are invalid after the last UNPACK macro in last case 1:
.
I tried to pass them as output operands in all or only last macro ( "=r" (mem), "=r" (pma)
) or somehow (how) to include them in inline asm clobbers but it only make the register handling worse with bad code again.
Only one solution is to disable function inlining ( __attribute__ ((noinline))
), but in this case I lose the advantage of GCC which can cut the proper number of macros and inline it if the nmb8byteBlocks is known in compile time. (The same drawback holds for rewriting code to pure assembly.)
Is there any possibility how to solve this in inline assembly?
I think you are looking for the +
constraint modifier, which means "this operand is both read and written". (See the " Modifiers " section of GCC's inline-assembly documentation.)
You also need to tell GCC that this asm
reads and writes memory; the easiest way to do that is by adding "memory"
to the clobber list. And that you clobber the "condition codes" with lsrs
, so a "cc"
clobber is also necessary. Try this:
#define UNPACK(src, dst) \
asm volatile ( \
"ldm %0!, {r2, r4} \n\t" \
"lsrs r3, r2, #16 \n\t" \
"lsrs r5, r4, #16 \n\t" \
"stm %1!, {r2-r5} \n\t" \
: "+r" (src), "+r" (dst) \
: /* no input-only operands */ \
: "r2", "r3", "r4", "r5", "memory", "cc" \
);
( Micro-optimization: since you don't use the condition codes from the shifts, it's better to use EDIT: I've been reminded that lsr
instead of lsrs
. It also makes the code easier to read months later; future you won't be scratching your head wondering if there's some reason why the condition codes are actually needed here.lsrs
has a more compact encoding than lsr
in Thumb format, which is enough of a reason to use it even though the condition codes aren't needed.)
(I would like to say that you'd get better register allocator behavior if you let GCC pick the scratch registers, but I don't know how to tell it to pick scratch registers in a particular numeric order as required by ldm
and stm
, or how to tell it to use only the registers accessible to 2-byte Thumb instructions.)
(It is possible to specify exactly what memory is read and written with "m"
-type input and output operands, but it's complicated and may not improve things much. If you discover that this code works but causes a bunch of unrelated stuff to get reloaded from memory into registers unnecessarily, consult How can I indicate that the memory *pointed* to by an inline ASM argument may be used? )
(You may get better code generation for what unpack
is inlined into, if you change its function signature to
static void unpack(const uint32_t *restrict src,
uint32_t *restrict dst,
unsigned int nmb8byteBlocks)
I would also experiment with adding if (nmb8byteBlocks > 8) __builtin_trap();
as the first line of the function.)
Many thanks zwol, this is exactly what I was looking for but couldn't find it in GCC inline assembly pages. It solved the problem perfectly - now the GCC makes a copy of src
and dst
in different registers and uses them correctly after the last UNPACK
macro.Two remarks:
lsrs
because it compiles to 2-bytes Cortex-M3 native lsrs
. If I use flags untouching lsr
version, it compiles to 4-bytes mov.w r3, r2, lsr #16
-> the 16-bit Thumb 2 lsr
is with 's' by default. Without the 's', the 32-bit Thumb 2 must be used (I have to check it). Anyway, I should add "cc" in clobbers in this case.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.