[英]How to mark as clobbered input operands (C register variables) in extended GCC inline assembly?
Problem description问题描述
I'm trying to design the C code unpacking array A
of uint32_t elements to array B
of uint32_t elements where each element of A
is unpacked to two consecutive elements of B
so that B[2*i]
contains low 16 bits of A[i]
and B[2*i + 1]
contains high 16 bits of A[i]
shifted right, ie,我正在尝试设计将 uint32_t 元素的数组
A
解包到 uint32_t 元素的数组B
的 C 代码,其中A
每个元素都被解包为B
两个连续元素,以便B[2*i]
包含A[i]
低 16 位A[i]
和B[2*i + 1]
包含A[i]
右移的高 16 位,即,
B[2*i] = A[i] & 0xFFFFul;
B[2*i+1] = A[i] >> 16u;
Note the arrays are aligned to 4, have variable length, but A
always contains multiple of 4 of uint32_t and the size is <= 32, B
has sufficient space for unpacking and we are on ARM Cortex-M3.请注意,数组对齐为 4,长度可变,但
A
始终包含 uint32_t 的 4 的倍数且大小 <= 32, B
有足够的空间用于解包,我们在 ARM Cortex-M3 上。
Current bad solution in GCC inline asm当前 GCC 内联汇编中的错误解决方案
As the GCC is not good in optimizing this unpacking, I wrote unrolled C & inline asm to make it speed optimized with acceptable code size and register usage.由于 GCC 不擅长优化这种解包,我编写了展开的 C 和内联汇编,以使其速度优化,代码大小和寄存器使用可接受。 The unrolled code looks like this:
展开的代码如下所示:
static void unpack(uint32_t * src, uint32_t * dst, uint8_t nmb8byteBlocks)
{
switch(nmb8byteBlocks) {
case 8:
UNPACK(src, dst)
case 7:
UNPACK(src, dst)
...
case 1:
UNPACK(src, dst)
default:;
}
}
where在哪里
#define UNPACK(src, dst) \
asm volatile ( \
"ldm %0!, {r2, r4} \n\t" \
"lsrs r3, r2, #16 \n\t" \
"lsrs r5, r4, #16 \n\t" \
"stm %1!, {r2-r5} \n\t" \
: \
: "r" (src), "r" (dst) \
: "r2", "r3", "r4", "r5" \
);
It works until the GCC's optimizer decides to inline the function (wanted property) and reuse register variables src
and dst
in the next code.它会一直工作,直到 GCC 的优化器决定内联函数(想要的属性)并在下一个代码中重用寄存器变量
src
和dst
。 Clearly, due to the ldm %0!
显然,由于
ldm %0!
and stm %1!
和
stm %1!
instructions the src
and dst
contain different addresses when leaving switch statement.指令
src
和dst
在离开 switch 语句时包含不同的地址。
How to solve it?如何解决?
I do not know how to inform GCC that registers used for src
and dst
are invalid after the last UNPACK macro in last case 1:
.我不知道如何通知 GCC 用于
src
和dst
寄存器在最后case 1:
中的最后一个 UNPACK 宏之后无效case 1:
I tried to pass them as output operands in all or only last macro ( "=r" (mem), "=r" (pma)
) or somehow (how) to include them in inline asm clobbers but it only make the register handling worse with bad code again.我试图将它们作为输出操作数传递给所有或仅最后一个宏(
"=r" (mem), "=r" (pma)
)或以某种方式(如何)将它们包含在内联 asm clobbers 中,但它只进行寄存器处理糟糕的代码再次变得更糟。
Only one solution is to disable function inlining ( __attribute__ ((noinline))
), but in this case I lose the advantage of GCC which can cut the proper number of macros and inline it if the nmb8byteBlocks is known in compile time.只有一种解决方案是禁用函数内联(
__attribute__ ((noinline))
),但在这种情况下,我失去了 GCC 的优势,如果 nmb8byteBlocks 在编译时已知,它可以减少适当数量的宏并内联它。 (The same drawback holds for rewriting code to pure assembly.) (同样的缺点也适用于将代码重写为纯汇编。)
Is there any possibility how to solve this in inline assembly?有没有可能如何在内联汇编中解决这个问题?
I think you are looking for the +
constraint modifier, which means "this operand is both read and written".我认为您正在寻找
+
约束修饰符,这意味着“此操作数既可读取又可写入”。 (See the " Modifiers " section of GCC's inline-assembly documentation.) (请参阅 GCC 内联汇编文档的“修饰符”部分。)
You also need to tell GCC that this asm
reads and writes memory;你还需要告诉GCC这个
asm
读写内存; the easiest way to do that is by adding "memory"
to the clobber list. 最简单的方法是将
"memory"
添加到 clobber 列表中。 And that you clobber the "condition codes" with lsrs
, so a "cc"
clobber is also necessary.并且您使用
lsrs
破坏“条件代码”,因此还需要"cc"
破坏。 Try this:尝试这个:
#define UNPACK(src, dst) \
asm volatile ( \
"ldm %0!, {r2, r4} \n\t" \
"lsrs r3, r2, #16 \n\t" \
"lsrs r5, r4, #16 \n\t" \
"stm %1!, {r2-r5} \n\t" \
: "+r" (src), "+r" (dst) \
: /* no input-only operands */ \
: "r2", "r3", "r4", "r5", "memory", "cc" \
);
( Micro-optimization: since you don't use the condition codes from the shifts, it's better to use EDIT: I've been reminded that lsr
instead of lsrs
. It also makes the code easier to read months later; future you won't be scratching your head wondering if there's some reason why the condition codes are actually needed here.lsrs
has a more compact encoding than lsr
in Thumb format, which is enough of a reason to use it even though the condition codes aren't needed.) (
微优化:由于不使用班次中的条件代码,因此最好使用编辑:我被提醒, lsr
而不是lsrs
。这也使几个月后的代码更易于阅读;将来您不会挠头想知道是否有这里实际上需要条件代码的一些原因。lsrs
编码比 Thumb 格式的lsr
更紧凑,即使不需要条件代码,这也足以成为使用它的理由。 )
(I would like to say that you'd get better register allocator behavior if you let GCC pick the scratch registers, but I don't know how to tell it to pick scratch registers in a particular numeric order as required by ldm
and stm
, or how to tell it to use only the registers accessible to 2-byte Thumb instructions.) (我想说,如果您让 GCC 选择临时寄存器,您将获得更好的寄存器分配器行为,但我不知道如何告诉它按照
ldm
和stm
要求以特定数字顺序选择临时寄存器,或者如何告诉它只使用 2 字节 Thumb 指令可访问的寄存器。)
(It is possible to specify exactly what memory is read and written with "m"
-type input and output operands, but it's complicated and may not improve things much. If you discover that this code works but causes a bunch of unrelated stuff to get reloaded from memory into registers unnecessarily, consult How can I indicate that the memory *pointed* to by an inline ASM argument may be used? ) (可以使用
"m"
类型的输入和输出操作数来准确指定读取和写入的内存,但这很复杂,可能不会有太大改善。如果您发现此代码有效但导致一堆不相关的东西得到不必要地从内存重新加载到寄存器中,请参阅如何指示可以使用内联 ASM 参数*指向*的内存? )
(You may get better code generation for what unpack
is inlined into, if you change its function signature to (您可能会得到更好的代码生成什么
unpack
内联到,如果你改变它的函数签名
static void unpack(const uint32_t *restrict src,
uint32_t *restrict dst,
unsigned int nmb8byteBlocks)
I would also experiment with adding if (nmb8byteBlocks > 8) __builtin_trap();
我还会尝试添加
if (nmb8byteBlocks > 8) __builtin_trap();
as the first line of the function.)作为函数的第一行。)
Many thanks zwol, this is exactly what I was looking for but couldn't find it in GCC inline assembly pages.非常感谢 zwol,这正是我要找的,但在 GCC 内联汇编页面中找不到。 It solved the problem perfectly - now the GCC makes a copy of
src
and dst
in different registers and uses them correctly after the last UNPACK
macro.Two remarks:它完美地解决了这个问题——现在 GCC 在不同的寄存器中制作了
src
和dst
的副本,并在最后一个UNPACK
宏之后正确使用它们。 两个评论:
lsrs
because it compiles to 2-bytes Cortex-M3 native lsrs
.lsrs
是因为它编译为 2 字节 Cortex-M3 本机lsrs
。 If I use flags untouching lsr
version, it compiles to 4-bytes mov.w r3, r2, lsr #16
-> the 16-bit Thumb 2 lsr
is with 's' by default.lsr
版本,它会编译为 4 字节mov.w r3, r2, lsr #16
-> 16 位 Thumb 2 lsr
默认带有 's'。 Without the 's', the 32-bit Thumb 2 must be used (I have to check it).
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.