简体   繁体   中英

What is the use of this asm block using x86 DIV in c++?

Can someone help me understand benefit of using asm block for unsigned long long multiplication in terms of performance. It is related to competitive programming optimization. I guess it makes multiplication faster, but I can't understand code actually.

const int md = 998244353;
inline int mul(int a, int b)
{
#if !defined(_WIN32) || defined(_WIN64)
    return (int) ((long long) a * b % md);
#endif
    unsigned long long x = (long long) a * b;
    unsigned xh = (unsigned) (x >> 32), xl = (unsigned) x, d, m;
    asm(
            "divl %4; \n\t"
            : "=a" (d), "=d" (m)
            : "d" (xh), "a" (xl), "r" (md)
    );
    return m;
}

This code is actually a speedup for 32-bit (where 64x64 => 128 multiplication isn't available so compilers use actual division, but loses badly on 64-bit where compilers do use a multiplicative inverse to avoid slow hardware division entirely. Why does GCC use multiplication by a strange number in implementing integer division?

Also, it should really use __builtin_constant_p to only use inline asm if either input is not a compile-time constant after inlining and constant-propagation.


But anyway, x86's div instruction does EDX:EAX / (src) => quotient(EAX) and divisor(EDX). See When and why do we sign extend and use cdq with mul/div?

The "a" and "d" constraints ask for the low and high halves of the 64-bit product in EAX and EDX respectively as inputs.

From the Godbolt compiler explorer :

const int md = 998244353;
int mul(int a, int b)
{
#ifdef __x86_64__ // FIXME: just use the asm if defined(i386) to exclude all others
    return (int) ((long long) a * b % md);
#else
    if(__builtin_constant_p(a) && __builtin_constant_p(b))
        return (int) ((long long) a * b % md);
      // clang's builtin_constant_p is evaled before inlining, derp

    unsigned long long x = (long long) a * b;
    unsigned xh = (unsigned) (x >> 32), xl = (unsigned) x, d, m;
    asm(
            "divl %4; \n\t"
            : "=a" (d), "=d" (m)
            : "d" (xh), "a" (xl), "r" (md)
    );
    return m;
#endif
}

int main() {
    return mul(1,2);
}

compiles as follows with gcc8.2 -O3 -m32 :

mul(int, int):
    mov     eax, DWORD PTR [esp+8]
    mov     ecx, 998244353
    imul    DWORD PTR [esp+4]     # one-operand imul does EDX:EAX = EAX * src
    divl ecx;                     # EDX:EAX / ecx => EAX and EDX

    mov     eax, edx              # return the remainder
    ret

main:
    mov     eax, 2     # builtin_constant_p used the pure C, allowing constant-propagation
    ret

Notice that div is unsigned division, so this is doesn't match the C. The C is doing signed multiplication and signed division. This should probably use idiv , or cast the inputs to unsigned. Or maybe they really do want weird results with negative inputs for some strange reason.

So why can't compilers emit this on their own, without inline asm? Because if the quotient overflows the destination register (al/ax/eax/rax), it faults with a #DE (Divide Exception) instead of silently truncating like all other integer instructions.

A 64-bit / 32-bit => 32-bit division is only safe if you know the divisor is large enough for the possible dividends. (But even if it is, gcc still doesn't know look for this optimization. eg a * 7ULL / 9 can't possibly cause a #DE if done with a single mul and div , if a was a 32-bit type. But gcc will still emit a call to a libgcc helper function.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM