简体   繁体   English

gcc 内联 asm x86 CPU 标志作为输入依赖项

[英]gcc inline asm x86 CPU flags as input dependency

I want to create a function for addition two 16-bit integers with overflow detection.我想创建一个函数,通过溢出检测将两个 16 位整数相加。 I have generic variant written in portable c.我有用便携式 c 编写的通用变体。 But the generic variant is not optimal for x86 target, because CPU internally calculate overflow flag when execute ADD/SUB/etc.但是通用变体对于 x86 目标不是最佳的,因为 CPU 在执行 ADD/SUB/etc 时在内部计算溢出标志。 Of course, there is __builtin_add_overflow() , but in my case it generates some boilerplate.当然,有__builtin_add_overflow() ,但就我而言,它会生成一些样板文件。 So I write the following code:所以我写了以下代码:

#include <cstdint>

struct result_t
{
    uint16_t src;
    uint16_t dst;
    uint8_t  of;
};

static void add_u16_with_overflow(result_t& r)
{
    char of, cf;
    asm (
        " addw %[dst], %[src] " 
        : [dst] "+mr"(r.dst)//, "=@cco"(of), "=@ccc"(cf)
        : [src] "imr" (r.src) 
        : "cc"
        );

    asm (" seto %0 " : "=rm" (r.of) );

}

uint16_t test_add(uint16_t a, uint16_t b)
{
    result_t r;
    r.src = a;
    r.dst = b;
    add_u16_with_overflow(r);
    add_u16_with_overflow(r);

    return (r.dst + r.of); // use r.dst and r.of for prevent discarding
}

I've played with https://godbolt.org/g/2mLF55 (gcc 7.2 -O2 -std=c++11) and it results我玩过https://godbolt.org/g/2mLF55 (gcc 7.2 -O2 -std=c++11) 结果

test_add(unsigned short, unsigned short):
  seto %al 
  movzbl %al, %eax
  addw %si, %di 
  addw %si, %di 
  addl %esi, %eax
  ret

So, seto %0 is reordered.因此, seto %0被重新排序。 It seems gcc think there is no dependency between two consequent asm() statements.似乎 gcc 认为两个随后的asm()语句之间没有依赖关系。 And "cc" clobber doesn't have any effect for flags dependency.并且“cc”clobber 对标志依赖性没有任何影响。

I can't use volatile because seto %0 or whole function can be (and have to) optimized out if result (or some part of result) is not used.我不能使用volatile因为如果不使用结果(或结果的某些部分),可以(并且必须)优化seto %0或整个函数。

I can add dependency for r.dst: asm (" seto %0 " : "=rm" (r.of) : "rm"(r.dst) );我可以为 r.dst 添加依赖项: asm (" seto %0 " : "=rm" (r.of) : "rm"(r.dst) ); , and reordering will not happen. ,并且不会发生重新排序。 But it is not a "right thing", and compiler still can insert some code changes flags (but not changes r.dst) between add and seto statement.但这不是“正确的事情”,编译器仍然可以在addseto语句之间插入一些代码更改标志(但不能更改 r.dst)。

Is there way to say "this asm() statement change some cpu flags" and "this asm() use some cpu flags" for dependency between statement and prevent reordering?有没有办法说“这个 asm() 语句更改了一些 cpu 标志”和“这个 asm() 使用一些 cpu 标志”用于语句之间的依赖关系并防止重新排序?

I haven't looked at gcc's output for __builtin_add_overflow , but how bad is it?我没有查看__builtin_add_overflow gcc 输出,但它有多糟糕? @David's suggestion to use it, and https://gcc.gnu.org/wiki/DontUseInlineAsm is usually good, especially if you're worried about how this will optimize. @David建议使用它, https://gcc.gnu.org/wiki/DontUseInlineAsm通常很好,特别是如果您担心这将如何优化。 asm defeats constant propagation and some other things. asm打败了持续传播和其他一些事情。

Also, if you are going to use ASM, note that syntax is add %[src], %[dst] operand order.另外,如果您打算使用 ASM,请注意语法是add %[src], %[dst]操作数顺序。 See the tag wiki for details, unless you're always going to build your code with -masm=intel .有关详细信息,请参阅标签 wiki ,除非您总是要使用-masm=intel构建代码。

Is there way to say "this asm() statement change some cpu flags" and "this asm() use some cpu flags" for dependency between statement and prevent reordering?有没有办法说“这个 asm() 语句更改了一些 cpu 标志”和“这个 asm() 使用一些 cpu 标志”用于语句之间的依赖关系并防止重新排序?

No. Put the flag-consuming instruction ( seto ) inside the same asm block as the flag-producing instruction .否。将标志消耗指令 ( seto ) 与标志产生指令放在同一个asm块中 An asm statement can have an many input and output operands as you like, limited only by register-allocation difficulty (but multiple memory outputs can use the same base register with different offsets). asm语句可以有许多输入和输出操作数,只要你喜欢,只受寄存器分配困难的限制(但多个内存输出可以使用具有不同偏移量的相同基址寄存器)。 Anyway, an extra write-only output on the statement containing the add isn't going to cause any inefficiency.无论如何,包含add的语句上的额外只写输出不会导致任何低效率。

I was going to suggest that if you want multiple flag outputs from one instruction, use LAHF to Load AH from FLAGS.我打算建议,如果您想从一条指令中输出多个标志,请使用 LAHF 从 FLAGS 加载 AH。 But that doesn't include OF, only the other condition codes.但这不包括 OF,仅包括其他条件代码。 This is often inconvenient and seems like a bad design choice because there are some unused reserved bits in the low 8 of EFLAGS/RFLAGS , so OF could have been in the low 8 along with CF, SF, ZF, PF, and AF.这通常很不方便,而且似乎是一个糟糕的设计选择,因为在 EFLAGS/RFLAGS 的低 8 位中一些未使用的保留位,因此 OF 可能与 CF、SF、ZF、PF 和 AF 一起位于低 8 位。 But since that isn't the case, setc + seto are probably better than pushf / reload, but that is worth considering.但由于事实并非如此, setc + seto可能比pushf / reload 更好,但这值得考虑。


Even if there was syntax for flag-input (like there is for flag-output), there would be very little to gain from letting gcc insert some of its own non-flag-modifying instructions (like lea or mov ) between your two separate asm statements.即使是为标志,输入语法(如存在为标志输出),将有很少的增益从让GCC插入了一些自己的非旗修改指令(像leamov你的两个独立之间) asm语句。

You don't want them reordered or anything, so putting them in the same asm statement makes by far the most sense.你不希望它们重新排序或任何东西,所以把它们放在同一个 asm 语句中是最有意义的。 Even on an in-order CPU, add is low latency so it's not a big bottleneck to put a dependent instruction right after it.即使在有序 CPU 上, add的延迟也很低,因此在它之后放置相关指令并不是一个大瓶颈。


And BTW, a jcc might be more efficient if overflow is an error condition that doesn't happen normally.顺便说一句,如果溢出是一种不正常发生的错误情况,则jcc可能会更有效。 But unfortunately GNU C asm goto doesn't support output operands.但不幸的是 GNU C asm goto不支持输出操作数。 You could take a pointer input and modify dst in memory (and use a "memory" clobber), but forcing a store/reload sucks more than using setc or seto to produce an input for a compiler-generated test / jnz .您可以获取指针输入并修改内存中的dst (并使用"memory" clobber),但强制存储/重新加载比使用setcseto为编译器生成的test / jnz生成输入更jnz

If you didn't also need an output, you could put C labels on a return true and a return false statement, which (after inlining) would turn your code into a jcc to wherever the compiler wanted to lay out the branches of an if() .如果您还不需要输出,您可以将 C 标签放在return truereturn false语句上,这(内联后)会将您的代码转换为 jcc 到编译器想要布置if()分支的任何位置if() . eg see how Linux does it: (with extra complicating factors in these two examples I found): setting up to patch the code after checking a CPU feature once at boot, or something with a section for a jump table in arch_static_branch .)例如,看看 Linux 是如何做到的:(在我发现的这两个示例中有额外的复杂因素):在启动时检查一次 CPU 功能后设置修补代码,或者在arch_static_branch包含跳转表部分的arch_static_branch 。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM