gcc inline asm x86 CPU flags as input dependency

Question

I want to create a function for addition two 16-bit integers with overflow detection. I have generic variant written in portable c. But the generic variant is not optimal for x86 target, because CPU internally calculate overflow flag when execute ADD/SUB/etc. Of course, there is __builtin_add_overflow() , but in my case it generates some boilerplate. So I write the following code:

#include <cstdint>

struct result_t
{
    uint16_t src;
    uint16_t dst;
    uint8_t  of;
};

static void add_u16_with_overflow(result_t& r)
{
    char of, cf;
    asm (
        " addw %[dst], %[src] " 
        : [dst] "+mr"(r.dst)//, "=@cco"(of), "=@ccc"(cf)
        : [src] "imr" (r.src) 
        : "cc"
        );

    asm (" seto %0 " : "=rm" (r.of) );

}

uint16_t test_add(uint16_t a, uint16_t b)
{
    result_t r;
    r.src = a;
    r.dst = b;
    add_u16_with_overflow(r);
    add_u16_with_overflow(r);

    return (r.dst + r.of); // use r.dst and r.of for prevent discarding
}

I've played with https://godbolt.org/g/2mLF55 (gcc 7.2 -O2 -std=c++11) and it results

test_add(unsigned short, unsigned short):
  seto %al 
  movzbl %al, %eax
  addw %si, %di 
  addw %si, %di 
  addl %esi, %eax
  ret

So, seto %0 is reordered. It seems gcc think there is no dependency between two consequent asm() statements. And "cc" clobber doesn't have any effect for flags dependency.

I can't use volatile because seto %0 or whole function can be (and have to) optimized out if result (or some part of result) is not used.

I can add dependency for r.dst: asm (" seto %0 " : "=rm" (r.of) : "rm"(r.dst) ); , and reordering will not happen. But it is not a "right thing", and compiler still can insert some code changes flags (but not changes r.dst) between add and seto statement.

Is there way to say "this asm() statement change some cpu flags" and "this asm() use some cpu flags" for dependency between statement and prevent reordering?

Answer 1

I haven't looked at gcc's output for __builtin_add_overflow , but how bad is it? @David's suggestion to use it, and https://gcc.gnu.org/wiki/DontUseInlineAsm is usually good, especially if you're worried about how this will optimize. asm defeats constant propagation and some other things.

Also, if you are going to use ASM, note that att syntax is add %[src], %[dst] operand order. See the tag wiki for details, unless you're always going to build your code with -masm=intel .

Is there way to say "this asm() statement change some cpu flags" and "this asm() use some cpu flags" for dependency between statement and prevent reordering?

No. Put the flag-consuming instruction ( seto ) inside the same asm block as the flag-producing instruction . An asm statement can have an many input and output operands as you like, limited only by register-allocation difficulty (but multiple memory outputs can use the same base register with different offsets). Anyway, an extra write-only output on the statement containing the add isn't going to cause any inefficiency.

I was going to suggest that if you want multiple flag outputs from one instruction, use LAHF to Load AH from FLAGS. But that doesn't include OF, only the other condition codes. This is often inconvenient and seems like a bad design choice because there are some unused reserved bits in the low 8 of EFLAGS/RFLAGS , so OF could have been in the low 8 along with CF, SF, ZF, PF, and AF. But since that isn't the case, setc + seto are probably better than pushf / reload, but that is worth considering.

Even if there was syntax for flag-input (like there is for flag-output), there would be very little to gain from letting gcc insert some of its own non-flag-modifying instructions (like lea or mov ) between your two separate asm statements.

You don't want them reordered or anything, so putting them in the same asm statement makes by far the most sense. Even on an in-order CPU, add is low latency so it's not a big bottleneck to put a dependent instruction right after it.

And BTW, a jcc might be more efficient if overflow is an error condition that doesn't happen normally. But unfortunately GNU C asm goto doesn't support output operands. You could take a pointer input and modify dst in memory (and use a "memory" clobber), but forcing a store/reload sucks more than using setc or seto to produce an input for a compiler-generated test / jnz .

If you didn't also need an output, you could put C labels on a return true and a return false statement, which (after inlining) would turn your code into a jcc to wherever the compiler wanted to lay out the branches of an if() . eg see how Linux does it: (with extra complicating factors in these two examples I found): setting up to patch the code after checking a CPU feature once at boot, or something with a section for a jump table in arch_static_branch .)

gcc inline asm x86 CPU flags as input dependency

Question

1 answers

solution1
2 2017-12-29 20:16:34

gcc inline asm x86 CPU flags as input dependency

Question

1 answers

solution1 2 2017-12-29 20:16:34

solution1
2 2017-12-29 20:16:34