简体   繁体   English

如何在扩展 GCC 内联汇编中标记为损坏的输入操作数(C 寄存器变量)?

[英]How to mark as clobbered input operands (C register variables) in extended GCC inline assembly?

Problem description问题描述

I'm trying to design the C code unpacking array A of uint32_t elements to array B of uint32_t elements where each element of A is unpacked to two consecutive elements of B so that B[2*i] contains low 16 bits of A[i] and B[2*i + 1] contains high 16 bits of A[i] shifted right, ie,我正在尝试设计将 uint32_t 元素的数组A解包到 uint32_t 元素的数组B的 C 代码,其中A每个元素都被解包为B两个连续元素,以便B[2*i]包含A[i]低 16 位A[i]B[2*i + 1]包含A[i]右移的高 16 位,即,

B[2*i] = A[i] & 0xFFFFul;
B[2*i+1] = A[i] >> 16u;

Note the arrays are aligned to 4, have variable length, but A always contains multiple of 4 of uint32_t and the size is <= 32, B has sufficient space for unpacking and we are on ARM Cortex-M3.请注意,数组对齐为 4,长度可变,但A始终包含 uint32_t 的 4 的倍数且大小 <= 32, B有足够的空间用于解包,我们在 ARM Cortex-M3 上。

Current bad solution in GCC inline asm当前 GCC 内联汇编中的错误解决方案

As the GCC is not good in optimizing this unpacking, I wrote unrolled C & inline asm to make it speed optimized with acceptable code size and register usage.由于 GCC 不擅长优化这种解包,我编写了展开的 C 和内联汇编,以使其速度优化,代码大小和寄存器使用可接受。 The unrolled code looks like this:展开的代码如下所示:

static void unpack(uint32_t * src, uint32_t * dst, uint8_t nmb8byteBlocks)
{
    switch(nmb8byteBlocks) {
        case 8:
            UNPACK(src, dst)
        case 7:
            UNPACK(src, dst)
        ...
        case 1:
            UNPACK(src, dst)
        default:;
    }
}

where在哪里

#define UNPACK(src, dst) \
    asm volatile ( \
        "ldm     %0!, {r2, r4} \n\t" \
        "lsrs    r3, r2, #16 \n\t" \
        "lsrs    r5, r4, #16 \n\t" \
        "stm     %1!, {r2-r5} \n\t" \
        : \
        : "r" (src), "r" (dst) \
        : "r2", "r3", "r4", "r5" \
    );

It works until the GCC's optimizer decides to inline the function (wanted property) and reuse register variables src and dst in the next code.它会一直工作,直到 GCC 的优化器决定内联函数(想要的属性)并在下一个代码中重用寄存器变量srcdst Clearly, due to the ldm %0!显然,由于ldm %0! and stm %1!stm %1! instructions the src and dst contain different addresses when leaving switch statement.指令srcdst在离开 switch 语句时包含不同的地址。

How to solve it?如何解决?

I do not know how to inform GCC that registers used for src and dst are invalid after the last UNPACK macro in last case 1: .我不知道如何通知 GCC 用于srcdst寄存器在最后case 1:中的最后一个 UNPACK 宏之后无效case 1:

I tried to pass them as output operands in all or only last macro ( "=r" (mem), "=r" (pma) ) or somehow (how) to include them in inline asm clobbers but it only make the register handling worse with bad code again.我试图将它们作为输出操作数传递给所有或仅最后一个宏( "=r" (mem), "=r" (pma) )或以某种方式(如何)将它们包含在内联 asm clobbers 中,但它只进行寄存器处理糟糕的代码再次变得更糟。

Only one solution is to disable function inlining ( __attribute__ ((noinline)) ), but in this case I lose the advantage of GCC which can cut the proper number of macros and inline it if the nmb8byteBlocks is known in compile time.只有一种解决方案是禁用函数内联( __attribute__ ((noinline)) ),但在这种情况下,我失去了 GCC 的优势,如果 nmb8byteBlocks 在编译时已知,它可以减少适当数量的宏并内联它。 (The same drawback holds for rewriting code to pure assembly.) (同样的缺点也适用于将代码重写为纯汇编。)

Is there any possibility how to solve this in inline assembly?有没有可能如何在内联汇编中解决这个问题?

I think you are looking for the + constraint modifier, which means "this operand is both read and written".认为您正在寻找+约束修饰符,这意味着“此操作数既可读取又可写入”。 (See the " Modifiers " section of GCC's inline-assembly documentation.) (请参阅 GCC 内联汇编文档的“修饰符”部分。)

You also need to tell GCC that this asm reads and writes memory;你还需要告诉GCC这个asm读写内存; the easiest way to do that is by adding "memory" to the clobber list. 最简单的方法是将"memory"添加到 clobber 列表中。 And that you clobber the "condition codes" with lsrs , so a "cc" clobber is also necessary.并且您使用lsrs破坏“条件代码”,因此还需要"cc"破坏。 Try this:尝试这个:

#define UNPACK(src, dst) \
    asm volatile ( \
        "ldm     %0!, {r2, r4} \n\t" \
        "lsrs    r3, r2, #16 \n\t" \
        "lsrs    r5, r4, #16 \n\t" \
        "stm     %1!, {r2-r5} \n\t" \
        : "+r" (src), "+r" (dst) \
        : /* no input-only operands */ \
        : "r2", "r3", "r4", "r5", "memory", "cc" \
    );

( Micro-optimization: since you don't use the condition codes from the shifts, it's better to use lsr instead of lsrs . It also makes the code easier to read months later; future you won't be scratching your head wondering if there's some reason why the condition codes are actually needed here. EDIT: I've been reminded that lsrs has a more compact encoding than lsr in Thumb format, which is enough of a reason to use it even though the condition codes aren't needed.) 微优化:由于不使用班次中的条件代码,因此最好使用lsr而不是lsrs 。这也使几个月后的代码更易于阅读;将来您不会挠头想知道是否有这里实际上需要条件代码的一些原因。编辑:我被提醒, lsrs编码比 Thumb 格式的lsr更紧凑,即使不需要条件代码,这也足以成为使用它的理由。 )

(I would like to say that you'd get better register allocator behavior if you let GCC pick the scratch registers, but I don't know how to tell it to pick scratch registers in a particular numeric order as required by ldm and stm , or how to tell it to use only the registers accessible to 2-byte Thumb instructions.) (我想说,如果您让 GCC 选择临时寄存器,您将获得更好的寄存器分配器行为,但我不知道如何告诉它按照ldmstm要求以特定数字顺序选择临时寄存器,或者如何告诉它只使用 2 字节 Thumb 指令可访问的寄存器。)

(It is possible to specify exactly what memory is read and written with "m" -type input and output operands, but it's complicated and may not improve things much. If you discover that this code works but causes a bunch of unrelated stuff to get reloaded from memory into registers unnecessarily, consult How can I indicate that the memory *pointed* to by an inline ASM argument may be used? ) (可以使用"m"类型的输入和输出操作数来准确指定读取和写入的内存,但这很复杂,可能不会有太大改善。如果您发现此代码有效但导致一堆不相关的东西得到不必要地从内存重新加载到寄存器中,请参阅如何指示可以使用内联 ASM 参数*指向*的内存?

(You may get better code generation for what unpack is inlined into, if you change its function signature to (您可能会得到更好的代码生成什么unpack内联到,如果你改变它的函数签名

static void unpack(const uint32_t *restrict src,
                   uint32_t *restrict dst,
                   unsigned int nmb8byteBlocks)

I would also experiment with adding if (nmb8byteBlocks > 8) __builtin_trap();我还会尝试添加if (nmb8byteBlocks > 8) __builtin_trap(); as the first line of the function.)作为函数的第一行。)

Many thanks zwol, this is exactly what I was looking for but couldn't find it in GCC inline assembly pages.非常感谢 zwol,这正是我要找的,但在 GCC 内联汇编页面中找不到。 It solved the problem perfectly - now the GCC makes a copy of src and dst in different registers and uses them correctly after the last UNPACK macro.Two remarks:它完美地解决了这个问题——现在 GCC 在不同的寄存器中制作了srcdst的副本,并在最后一个UNPACK宏之后正确使用它们。 两个评论:

  1. I use lsrs because it compiles to 2-bytes Cortex-M3 native lsrs .我使用lsrs是因为它编译为 2 字节 Cortex-M3 本机lsrs If I use flags untouching lsr version, it compiles to 4-bytes mov.w r3, r2, lsr #16 -> the 16-bit Thumb 2 lsr is with 's' by default.如果我使用标志不变lsr版本,它会编译为 4 字节mov.w r3, r2, lsr #16 -> 16 位 Thumb 2 lsr默认带有 's'。 Without the 's', the 32-bit Thumb 2 must be used (I have to check it).如果没有“s”,则必须使用 32 位 Thumb 2(我必须检查它)。 Anyway, I should add "cc" in clobbers in this case.无论如何,在这种情况下,我应该在clobbers中添加“cc”。
  2. In code above, I removed the nmb8byteBlocks value range check to make it clear.在上面的代码中,我删除了 nmb8byteBlocks 值范围检查以使其清楚。 But of course, your last sentence is a good point not only for all C programmers.但是,当然,您的最后一句话不仅对所有 C 程序员来说都是一个好点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 我如何告诉GCC asm输入寄存器被破坏? - How do I tell GCC asm that an input register is clobbered? 如何用扩展的gcc程序集指定x87 FPU堆栈的破坏底部? - How to specify clobbered bottom of the x87 FPU stack with extended gcc assembly? 理解GCC内联汇编语法中的输入/输出操作数 - Understanding input/output operands in GCC inline assembly syntax 我可以在gcc内联汇编中修改输入操作数吗 - Can I modify input operands in gcc inline assembly 如何在 GCC 的内联汇编中指定特定的寄存器来分配 C 表达式的结果? - How to specify a specific register to assign the result of a C expression in inline assembly in GCC? 扩展的内联汇编GCC-编译时表达式错误后,错误的寄存器名称和垃圾内容“完成” - Extended Inline Assembly GCC- bad register name and junk 'done' after expression error when compiling 在不使用gcc的内联汇编的情况下访问寄存器 - Accessing a register without using inline assembly with gcc (GNU内联汇编)如何使用既没有分配也没有复制到C变量的寄存器? - (GNU inline assembly) How to use a register which not assigned from nor copy to the C variables? 从gcc的内联汇编中引用全局变量 - referencing global variables from inline assembly in gcc 何时在扩展GCC内联汇编中使用特定操作数约束? - When to use a particular operand constraint in extended GCC inline assembly?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM