Can GCC emit different instruction mnemonics when choosing between multiple alternative operand constraints of inline assembly?

Question

I am trying to write inline x86-64 assembly for GCC to efficiently use the MULQ instruction. MULQ multiplies the 64-bit register RAX with another 64-bit value. The other value can be any 64-bit register (even RAX) or a value in memory. MULQ puts the high 64 bits of the product into RDX and the low 64 bits into RAX.

Now, it's easy enough to express a correct mulq as inline assembly:

#include <stdint.h>
static inline void mulq(uint64_t *high, uint64_t *low, uint64_t x, uint64_t y)
{
    asm ("mulq %[y]" 
          : "=d" (*high), "=a" (*low)
          : "a" (x), [y] "rm" (y)    
        );
}

This code is correct, but not optimal. MULQ is commutative, so if y happened to be in RAX already, then it would be correct to leave y where it is and do the multiply. But GCC doesn't know that, so it will emit extra instructions to move the operands into their pre-defined places. I want to tell GCC that it can put either input in either location, as long as one ends up in RAX and the MULQ references the other location. GCC has a syntax for this, called "multiple alternative constraints". Notice the commas (but the overall asm() is broken; see below):

asm ("mulq %[y]" 
      : "=d,d" (*high), "=a,a" (*low)
      : "a,rm" (x), [y] "rm,a" (y)    
    );

Unfortunately, this is wrong. If GCC chooses the second alternative constraint, it will emit "mulq %rax". To be clear, consider this function:

uint64_t f()
{
    uint64_t high, low;
    uint64_t rax;
    asm("or %0,%0": "=a" (rax));
    mulq(&high, &low, 7, rax);
    return high;
}

Compiled with gcc -O3 -c -fkeep-inline-functions mulq.c , GCC emits this assembly:

0000000000000010 <f>:
  10:   or     %rax,%rax
  13:   mov    $0x7,%edx
  18:   mul    %rax
  1b:   mov    %rdx,%rax
  1e:   retq

The "mul %rax" should be "mul %rdx".

How can this inline asm be rewritten so it generates the correct output in every case?

Answer 1

This 2012's question is still very relevant in 2019. Although gcc has changed and some code generated was not optimal back in 2012 but is now, the other way around also holds.

Inspired by Whitlock 's analysis, I've tested mulq in 9 different cases where each of x and y is either a constant ( 5 , 6 ) or a value in memory ( bar , zar ) or a value in rax ( f1() , f2() ):

uint64_t h1() { uint64_t h, l; mulq(&h, &l,    5,    6); return h + l; }
uint64_t h2() { uint64_t h, l; mulq(&h, &l,    5,  bar); return h + l; }
uint64_t h3() { uint64_t h, l; mulq(&h, &l,    5, f1()); return h + l; }
uint64_t h4() { uint64_t h, l; mulq(&h, &l,  bar,    5); return h + l; }
uint64_t h5() { uint64_t h, l; mulq(&h, &l,  bar,  zar); return h + l; }
uint64_t h6() { uint64_t h, l; mulq(&h, &l,  bar, f1()); return h + l; }
uint64_t h7() { uint64_t h, l; mulq(&h, &l, f1(),    5); return h + l; }
uint64_t h8() { uint64_t h, l; mulq(&h, &l, f1(),  bar); return h + l; }
uint64_t h9() { uint64_t h, l; mulq(&h, &l, f1(), f2()); return h + l; }

I've tested 5 implementations: Staufk , Whitlock , Hale , Burdo and my own:

inline void mulq(uint64_t *high, uint64_t *low, uint64_t x, uint64_t y) {
    asm("mulq %[y]" : [a]"=a,a"(*low), "=d,d"(*high) : "%a,rm"(x), [y]"rm,a"(y) : "cc");
}

All implementation are still unable to produce optimal code in all cases. Whilst other's fail to produce optimal code for h3, h4 and h6 , Whitlock's and mine fail only for h3 :

h3():
 callq 4004d0 <f1()>
 mov %rax,%r8
 mov $0x5,%eax
 mul %r8
 add %rdx,%rax
 retq

Everything else being equal, one can see that mine is simpler than Whitlock's. With an extra level of indirection and using a gcc's built in function (also available in clang but I haven't tested) it's possible to get optimal h3 by calling this function instead of mulq :

inline void mulq_fixed(uint64_t* high, uint64_t* low, uint64_t x, uint64_t y) {
    if (__builtin_constant_p(x))
        mulq(high, low, y, x);
    else
        mulq(high, low, x, y);
}

yields:

h3():
 callq 4004d0 <f1()>
 mov $0x5,%edx
 mul %rdx
 add %rdx,%rax
 retq

The idea of using __builtin_constant_p was actually taken from gcc 's doc:

There is no way within the template to determine which alternative was chosen. However you may be able to wrap your asm statements with builtins such as __builtin_constant_p to achieve the desired results.

See for yourself in Compiler Explorer .

Note: There's another smaller and unexpected disadvantage of Whitlock's implementation. You need to check option 11010 in Compiler Explorer otherwise the output is misleading and functions h1 , ..., h9 appear to use instruction mulq twice. This is because Compiler Explorer's parser does not process the assembler directives .ifnc / .else / .endif properly and simply remove them, showing both possible paths (the .if 's and the .else 's). Alternatively, you can uncheck option .text .

Answer 2

__asm__ ("mulq %3" : "=a,a" (*low), "=d,d" (*high) : "%0,0" (x), "r,m" (y))

This is similar to what you'll find in longlong.h included with various GNU packages; "r,m" rather than "rm" is really for clang's benefit. Multiple constraint syntax still appears to be important for clang, as discussed here . Which is a shame, but I still find that clang does a worse job of constraint matching (especially on x86[-86]) than gcc. for gcc:

__asm__ ("mulq %3" : "=a" (*low), "=d" (*high) : "%0" (x), "rm" (y))

would be sufficient, and would favour keeping (y) in a register, unless the register pressure was too high; but clang always seems to spill in many cases. My tests show it will chose the first option "r" in the multiple constraint syntax.

"%3" as a multiplicand in the instruction allows either a register (favoured) or memory location, as aliased by the third operand, relative to zero , which is (y) . "0" aliases the 'zero-th' operand: (*low) , which is explicitly "a" , ie, %rax for 64-bit. The leading % character in "%0" is the commutative operator: that is, (x) may commute with (y) if that helps register allocation. Obviously, mulq is commutative as: x * y == y * x .

We're actually quite constrained here. mulq multiplies the 64-bit operand %3 by the value in %rax to produce the 128-bit product: %rdx:%rax . the "0" (x) would mean that (x) has to be loaded into %rax , and (y) has to be loaded into a 64-bit register or memory address. However %0 means that (x) , and the following input (y) may commute.

I would also refer to the best practical inline assembly tutorial I've found. While the gcc references are 'authoritative', they make for a poor tutorial.

Thanks to Chris to picking up the error in my original constraint ordering.

Answer 3

Brett Hale's answer produces suboptimal code in some cases (at least on GCC 5.4.0).

Given:

static inline void mulq(uint64_t *high, uint64_t *low, uint64_t x, uint64_t y) {
    __asm__ ("mulq %3" : "=a" (*low), "=d" (*high) : "%0" (x), "rm" (y) : "cc");
}

uint64_t foo();

Then mulq(&high, &low, foo(), 42) compiles to:

    call    foo
    movl    $42, %edx
    mulq    %rdx

…which is optimal.

But now reverse the order of the operands:

    mulq(&high, &low, 42, foo());

…and look at what happens to the compiled code:

    call    foo
    movq    %rax, %rdx
    movl    $42, %eax
    mulq    %rdx

Oops! What happened? The compiler is insisting on putting 42 in rax , and so it must move the return value from foo() out of rax . Evidently the % (commutative) operand constraint is defective.

Is there any way to optimize this? It turns out there is, though it's a bit messy.

static inline void mulq(uint64_t *high, uint64_t *low, uint64_t x, uint64_t y) {
    __asm__ (
        ".ifnc %2,%%rax\n\t"
        "mulq %2\n\t"
        ".else\n\t"
        "mulq %3\n\t"
        ".endif"
        : "=a,a" (*low), "=d,d" (*high)
        : "a,rm" (x), "rm,a" (y)
        : "cc");
}

Now mulq(&high, &low, foo(), 42) compiles to:

    call    foo
    movl    $42, %edx
    .ifnc   %rax,%rax
    mulq    %rax
    .else
    mulq    %rdx
    .endif

And mulq(&high, &low, 42, foo()) compiles to:

    call    foo
    movl    $42, %edx
    .ifnc   %rdx,%rax
    mulq    %rdx
    .else
    mulq    %rax
    .endif

This code uses an assembler trick to get around the limitation that GCC doesn't let us emit different assembly code depending on the constraints alternative it has chosen. In each case, the assembler will emit only one of the two possible mulq instructions, depending on whether the compiler has chosen to put x or y in rax .

Sadly, this trick is suboptimal if we are multiplying the return value of foo() by the value at a memory location:

extern uint64_t bar;

Now mulq(&high, &low, bar, foo()) compiles to:

    call    foo
    .ifnc bar(%rip),%rax
    mulq bar(%rip)
    .else
    mulq %rax
    .endif

…which is optimal, but mulq(&high, &low, foo(), bar) compiles to:

    movq    bar(%rip), %rbx
    call    foo
    .ifnc   %rax,%rax
    mulq    %rax
    .else
    mulq    %rbx
    .endif

…which needlessly copies bar into rbx .

I have not been able to find a way to make GCC output optimal code in all cases, unfortunately. Forcing the multiplier to be a memory operand, for the sake of investigation, only causes GCC to load bar(%rip) into a register and then store that register into a temporary stack location that it then passes to mulq .

Answer 4

Separate from the general question about inline asm syntax:

You don't actually need inline asm for 64x64 => 128-bit multiply .
GCC/clang/ICC know how to optimize a * (unsigned __int128)b to a single mul instruction. Given the choice between two GNU C extensions (inline asm vs. __int128 ) always avoid inline asm if you can get the compiler to emit nice asm on its own. https://gcc.gnu.org/wiki/DontUseInlineAsm

unsigned __int128 foo(unsigned long a, unsigned long b) {
    return a * (unsigned __int128)b;
}

Compiles on gcc/clang/ICC to this, on the Godbolt compiler explorer

# gcc9.1 -O3  x86-64 SysV calling convention
foo(unsigned long, unsigned long):
        movq    %rdi, %rax
        mulq    %rsi
        ret                         # with the return value in RDX:RAX

Or return the high half with

unsigned long umulhi64(unsigned long a, unsigned long b) {
    unsigned __int128 res = a * (unsigned __int128)b;
    return res >> 64;
}

        movq    %rdi, %rax
        mulq    %rsi
        movq    %rdx, %rax
        ret

GCC fully understands what's going on here, and that * is commutative so it can use either input as a memory operand if it only has one in a register but not the other.

AFAIK it's not in general possible to use a different asm template depending on some inputs coming from registers or memory, unfortunately. So using a different strategy entirely (eg loading straight into SIMD registers instead of doing something integer) isn't possible.

The multi-alternative constraint thing is pretty limited, mainly only good for memory-source vs. memory-destination versions of an instruction like add , or things like that.

Answer 5

Use trick like this:

void multiply(unsigned& rhi, unsigned& rlo, unsigned a, unsigned b)
{
__asm__(
"    mull  %[b]\n"
:"=d"(rhi),"=a"(rlo)
:"1"(a),[b]"rm"(b));
}

Notice "1" argument specification for input operand a . This means "put 'a' into the same place where argument #1 is".

Can GCC emit different instruction mnemonics when choosing between multiple alternative operand constraints of inline assembly?

Question

5 answers

solution1
5 2019-07-03 23:30:08

solution2
4 2013-04-07 17:05:06

solution3
3 2017-01-22 23:14:53

solution4
3 2019-07-04 00:52:54

solution5
0 2013-10-05 00:05:22

Can GCC emit different instruction mnemonics when choosing between multiple alternative operand constraints of inline assembly?

Question

5 answers

solution1 5 2019-07-03 23:30:08

solution2 4 2013-04-07 17:05:06

solution3 3 2017-01-22 23:14:53

solution4 3 2019-07-04 00:52:54

solution5 0 2013-10-05 00:05:22

solution1
5 2019-07-03 23:30:08

solution2
4 2013-04-07 17:05:06

solution3
3 2017-01-22 23:14:53

solution4
3 2019-07-04 00:52:54

solution5
0 2013-10-05 00:05:22