简体   繁体   中英

gcc inline assembly - operand type mismatch for `add', trying to create branchless code

I'm trying to do some Code Optimization to Eliminate Branches, the original c code is

if( a < b ) 
   k = (k<<1) + 1;
else
   k = (k<<1)

I intend to replace it with assembly code like below

mov a, %rax 
mov b, %rbx
mov k, %rcx
xor %rdx %rdx
shl 1, %rcx
cmp %rax, %rax
setb %rdx
add %rdx,%rcx
mov %rcx, k 

so I write c inline assembly code like blow,

#define next(a, b, k)\
 __asm__("shl $0x1, %0; \
         xor %%rbx, %%rbx; \
         cmp %1, %2; \
         setb %%rbx; \
         addl  %%rbx,%0;":"+c"(k) :"g"(a),"g"(b))

when I compile the code below i got error:

operand type mismatch for `add'
operand type mismatch for `setb'

How can I fix it?

Given that gcc (and it looks like gcc inline assembler) produces:

leal    (%rdx,%rdx), %eax
xorl    %edx, %edx
cmpl    %esi, %edi
setl    %dl
addl    %edx, %eax
ret

from

int f(int a, int b, int k)
{
  if( a < b ) 
    k = (k<<1) + 1;
  else
    k = (k<<1);

  return k;
}

It would think that writing your own inline assembler is a complete waste of time and effort.

As always, BEFORE you start writing inline assembler, check what the compiler actually does. If your compiler doesn't produce this code, then you may need to upgrade the version of compiler to something a bit newer (I reported this sort of thing to Jan Hubicka [gcc maintainer for x86-64 at the time] ca 2001, and I'm sure it's been in gcc for quite some time).

You could just do this and the compiler will not generate a branch:

k = (k<<1) + (a < b) ;

But if you must, I fixed some stuff in your code now it should work as expected:

__asm__(
        "shl  $0x1, %0; \
        xor  %%eax, %%eax; \
        cmpl %3, %2; \
        setb %%al; \
        addl %%eax, %0;"
        :"=r"(k)        /* output */
        :"0"(k), "r"(a),"r"(b)  /* input */
        :"eax", "cc"   /* clobbered register */ 
);

Note that setb expects a reg8 or mem8 and you should add eax to the clobbered list, because you change it, as well as cc just to be safe, as for the register constraints, I'm not sure why you used those, but =r and r work just fine. And you need to add k to both the input and output lists. There's more in the GCC-Inline-Assembly-HOWTO

Here are the mistakes in your code:

  1. Error: operand type mismatch for 'cmp' -- One of CMP 's operands must be a register. You're probably generating code that's trying to compare two immediates. Change the second operand's constraint from "g" to "r" . (See GCC Manual - Extended Asm - Simple Constraints )
  2. Error: operand type mismatch for 'setb' -- SETB only takes 8 bit operands, ie setb %bl works while setb %rbx doesn't.
  3. The C expression T = (A < B) should translate to cmp B,A; setb T cmp B,A; setb T in AT&T x86 assembler syntax. You had the two operands to CMP in the wrong order. Remember that CMP works like SUB .

Once you realize the first two error messages are produced by the assembler, it follows that the trick to debugging them is to look at the assembler code generated by gcc. Try gcc $CFLAGS -S tc and compare the problematic lines in ts with an x86 opcode reference . Focus on the allowed operand codes for each instruction and you'll quickly see the problems.

In the fixed source code posted below, I assume your operands are unsigned since you're using SETB instead of SETL . I switched from using RBX to RCX to hold the temporary value because RCX is a call clobbered register in the ABI and used the "=&c" constraint to mark it as an earlyclobber operand since RCX is cleared before the inputs a and b are read:

#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>

static uint64_t next(uint64_t a, uint64_t b, uint64_t k)
{
    uint64_t tmp;
    __asm__("shl $0x1, %[k];"
        "xor %%rcx, %%rcx;"
        "cmp %[b], %[a];"
        "setb %%cl;"
        "addq %%rcx, %[k];"
        : /* outputs */ [k] "+g" (k), [tmp] "=&c" (tmp)
        : /* inputs  */ [a] "r" (a), [b] "g" (b)
        : /* clobbers */ "cc");
    return k;
}

int main()
{
    uint64_t t, t0, k;
    k = next(1, 2, 0);
    printf("%" PRId64 "\n", k);

    scanf("%" SCNd64 "%" SCNd64, &t, &t0);
    k = next(t, t0, k);
    printf("%" PRId64 "\n", k);

    return 0;
}

main() translates to:

<+0>:   push   %rbx
<+1>:   xor    %ebx,%ebx
<+3>:   mov    $0x4006c0,%edi
<+8>:   mov    $0x1,%bl
<+10>:  xor    %eax,%eax
<+12>:  sub    $0x10,%rsp
<+16>:  shl    %rax
<+19>:  xor    %rcx,%rcx
<+22>:  cmp    $0x2,%rbx
<+26>:  setb   %cl
<+29>:  add    %rcx,%rax
<+32>:  mov    %rax,%rbx
<+35>:  mov    %rax,%rsi
<+38>:  xor    %eax,%eax
<+40>:  callq  0x400470 <printf@plt>
<+45>:  lea    0x8(%rsp),%rdx
<+50>:  mov    %rsp,%rsi
<+53>:  mov    $0x4006c5,%edi
<+58>:  xor    %eax,%eax
<+60>:  callq  0x4004a0 <__isoc99_scanf@plt>
<+65>:  mov    (%rsp),%rax
<+69>:  mov    %rbx,%rsi
<+72>:  mov    $0x4006c0,%edi
<+77>:  shl    %rsi
<+80>:  xor    %rcx,%rcx
<+83>:  cmp    0x8(%rsp),%rax
<+88>:  setb   %cl
<+91>:  add    %rcx,%rsi
<+94>:  xor    %eax,%eax
<+96>:  callq  0x400470 <printf@plt>
<+101>: add    $0x10,%rsp
<+105>: xor    %eax,%eax
<+107>: pop    %rbx
<+108>: retq   

You can see the result of next() being moved into RSI before each printf() call.

Summary:

  • Branchless might not even be the best choice.
  • Inline asm defeats some other optimizations, try other source changes first , eg ? : ? : often compiles branchlessly, also use booleans as integer 0/1.
  • If you use inline-asm, make sure you optimize the constraints as well to make the compiler-generated code outside your asm block efficient.
  • The whole thing is doable with cmp %[b], %[a] / adc %[k],%[k] . Your hand-written code is worse than what compilers generate, but they are beatable in the small scale for cases where constant-propagation / CSE / inlining didn't make this code (partially) optimize away.

If your compiler generates branchy code, and profiling shows that was the wrong choice (high counts for branch misses at that instruction, eg on Linux perf record -ebranch-misses ./my_program && perf report ), then yes you should do something to get branchless code.

(Branchy can be an advantage if it's predictable: branching means out-of-order execution of code that uses (k<<1) + 1 doesn't have to wait for a and b to be ready. LLVM recently merged a patch that makes x86 code-gen more branchy by default , because modern x86 CPUs have such powerful branch predictors. Clang/LLVM nightly build (with that patch) does still choose branchless for this C source, at least in a stand-alone function outside a loop).

If this is for a binary search, branchless probably is good strategy, unless you see the same search often. (Branching + speculative execution means you have a control dependency off the critical path,

Compile with profile-guided optimization so the compiler has run-time info on which branches almost always go one way. It still might not know the difference between a poorly-predictable branch and one that does overall take both paths but with a simple pattern. (Or that's predictable based on global history; many modern branch-predictor designs index based on branch history , so which way the last few branches went determine which table entry is used for the current branch.)

Related: gcc optimization flag -O3 makes code slower then -O2 shows a case where a sorted array makes for near-perfect branch prediction for a condition inside a loop, and gcc -O3 's branchless code (without profile guided optimization) bottlenecks on a data dependency from using cmov . But -O3 -fprofile-use makes branchy code. (Also, a different way of writing it makes lower-latency branchless code that also auto-vectorizes better.)


Inline asm should be your last resort if you can't hand-hold the compiler into making the asm you want , eg by writing it as (k<<1) + (a<b) as others have suggested.

Inline asm defeats many optimizations, most obvious constant-propagation (as seen in some other answers, where gcc moves a constant into a register outside the block of inline-asm code). https://gcc.gnu.org/wiki/DontUseInlineAsm .

You could maybe use if(__builtin_constant_p(a)) and so on to use a pure C version when the compiler has constant values for some/all of the variables, but that's a lot more work. (And doesn't work well with Clang, where __builtin_constant_p() is evaluated before function inlining.)

Even then (once you've limited things to cases where the inputs aren't compile-time constants), it's not possible to give the compiler the full range of options, because you can't use different asm blocks depending on which constraints are matched (eg a in a register and b in memory, or vice versa.) In cases where you want to use a different instruction depending on the situation, you're screwed, but here we can use multi-alternative constraints to expose most of the flexibility of cmp .

It's still usually better to let the compiler make near-optimal code than to use inline asm . Inline-asm destroys the ability of the compiler to reuse use any temporary results, or spread out the instructions to mix with other compiler-generated code. (Instruction-scheduling isn't a big deal on x86 because of good out-of-order execution, but still.)


That asm is pretty crap. If you get a lot of branch misses, it's better than a branchy implementation, but a much better branchless implementation is possible.

Your a<b is an unsigned compare (you're using setb , the unsigned below condition). So your compare result is in the carry flag. x86 has an add-with-carry instruction. Furthermore, k<<1 is the same thing as k+k .

So the asm you want (compiler-generated or with inline asm) is:

# k in %rax,    a in %rdi,  b in %rsi   for this example
cmp     %rsi, %rdi      # CF = (a < b) = the carry-out from edi - esi
adc     %rax, %rax      # eax = (k<<1) + CF  = (k<<1) + (a < b)

Compilers are smart enough to use add or lea for a left-shift by 1, and some are smart enough to use adc instead of setb , but they don't manage to combine both.

Writing a function with register args and a return value is often a good way to see what compilers might do, although it does force them to produce the result in a different register. (See also this Q&A , and Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” ).

// I also tried a version where k is a function return value,
// or where k is a global, so it's in the same register.
unsigned funcarg(unsigned a, unsigned b, unsigned k) {
    if( a < b ) 
       k = (k<<1) + 1;
    else
       k = (k<<1);
    return k;
}

On the Godbolt compiler explorer , along with a couple other versions. (I used unsigned in this version, because you had addl in your asm. Using unsigned long makes everything except the xor-zeroing into 64-bit registers. ( xor %eax,%eax is still the best way to zero RAX.)

 # gcc7.2 -O3  When it can keep the value in the same reg, uses add instead of lea
    leal    (%rdx,%rdx), %eax       #, <retval>
    cmpl    %esi, %edi      # b, a
    adcl    $0, %eax        #, <retval>
    ret

#clang 6.0 snapshot -O3 xorl %eax, %eax cmpl %esi, %edi setb %al leal (%rax,%rdx,2), %eax retq

# ICC18, same as gcc but fails to save a MOV addl %edx, %edx #14.16 cmpl %esi, %edi #17.12 adcl $0, %edx #17.12 movl %edx, %eax #17.12 ret #17.12

MSVC is the only compiler that doesn't make branchless code without hand-holding. ( (k<<1) + ( a < b ); gives us exactly the same xor / cmp / setb / lea sequence as clang above (but with the Windows x86-64 calling convention).

funcarg PROC                         ; x86-64 MSVC CL19 -Ox
    lea      eax, DWORD PTR [r8*2+1]
    cmp      ecx, edx
    jb       SHORT $LN3@funcarg
    lea      eax, DWORD PTR [r8+r8]   ; conditionally jumped over
$LN3@funcarg:
    ret      0

Inline asm

The other answers cover the problems with your implementation pretty well. To debug assembler errors in inline asm, use gcc -O3 -S -fverbose-asm to see what the compiler is feeding to the assembler, with the asm template filled in. You would have seen addl %rax, %ecx or something.

This optimized implementation uses multi-alternative constraints to let the compiler pick either the cmp $imm, r/m , cmp r/m, r , or cmp r, r/m forms of CMP. I used two alternates that split things up not by opcode but by which side included the possible memory operand. "rme" is like "g" (rmi) but limited to 32-bit sign-extended immediates).

unsigned long inlineasm(unsigned long a, unsigned long b, unsigned long k)
{
    __asm__("cmpq %[b], %[a]   \n\t"
            "adc %[k],%[k]"
        : /* outputs */ [k] "+r,r" (k)
        : /* inputs  */ [a] "r,rm" (a), [b] "rme,re" (b)
        : /* clobbers */ "cc");  // "cc" clobber is implicit for x86, but it doesn't hurt
    return k;
}

I put this on Godbolt with callers that inline it in different contexts . gcc7.2 -O3 does what we expect for the stand-alone version (with register args).

inlineasm:
    movq    %rdx, %rax      # k, k
    cmpq %rsi, %rdi         # b, a
    adc %rax,%rax   # k
    ret

We can look at how well our constraints work by inlining into other callers:

unsigned long call_with_mem(unsigned long *aptr) {
    return inlineasm(*aptr, 5, 4);
}
    # gcc
    movl    $4, %eax        #, k
    cmpq $55555, (%rdi)     #, *aptr_3(D)
    adc %rax,%rax   # k
    ret

With a larger immediate, we get movabs into a register. (But with an "i" or "g" constraint, gcc would emit code that doesn't assemble, or truncates the constant, trying to use a large immediate constant for cmpq.)

Compare what we get from pure C:

unsigned long call_with_mem_nonasm(unsigned long *aptr) {
    return handhold(*aptr, 5, 4);
}
    # gcc -O3
    xorl    %eax, %eax      # tmp93
    cmpq    $4, (%rdi)      #, *aptr_3(D)
    setbe   %al   #, tmp93
    addq    $8, %rax        #, k
    ret

adc $8, %rax without setc would probably have been better, but we can't get that from inline asm without __builtin_constant_p() on k .


clang often picks the mem alternative if there is one, so it does this: /facepalm. Don't use inline asm.

inlineasm:   # clang 5.0
    movq    %rsi, -8(%rsp)
    cmpq    -8(%rsp), %rdi
    adcq    %rdx, %rdx
    movq    %rdx, %rax
    retq

BTW, unless you're going to optimize the shift into the compare-and-add, you can and should have asked the compiler for k<<1 as an input.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM