Calling MASM PROC from C++/CLI in x64 mode yields unexpected performance problems

Question

I'm writing an arbitrary precision integer class to be used in C# (64-bit). Currently I'm working on the multiplication routine, using a recursive divide-and-conquer algorithm to break down the multi-bit multiplication into a series of primitive 64-to-128-bit multiplications, the results of which are recombined then by simple addition. In order to get a significant performance boost, I'm writing the code in native x64 C++, embedded in a C++/CLI wrapper to make it callable from C# code.

It all works great so far, regarding the algorithms. However, my problem is the optimization for speed. Since the 64-to-128-bit multiplication is the real bottleneck here, I tried to optimize my code right there. My first simple approach was a C++ function that implements this multiplication by performing four 32-to-64-bit multiplications and recombining the results with a couple of shifts and adds. This is the source code:

// 64-bit to 128-bit multiplication, using the following decomposition:
// (a*2^32 + i) (b*2^32 + i) = ab*2^64 + (aj + bi)*2^32 + ij

public: static void Mul (UINT64  u8Factor1,
                         UINT64  u8Factor2,
                         UINT64& u8ProductL,
                         UINT64& u8ProductH)
    {
    UINT64 u8Result1, u8Result2;
    UINT64 u8Factor1L = u8Factor1 & 0xFFFFFFFFULL;
    UINT64 u8Factor2L = u8Factor2 & 0xFFFFFFFFULL;
    UINT64 u8Factor1H = u8Factor1 >> 32;
    UINT64 u8Factor2H = u8Factor2 >> 32;

    u8ProductL = u8Factor1L * u8Factor2L;
    u8ProductH = u8Factor1H * u8Factor2H;
    u8Result1  = u8Factor1L * u8Factor2H;
    u8Result2  = u8Factor1H * u8Factor2L;

    if (u8Result1 > MAX_UINT64 - u8Result2)
        {
        u8Result1 +=  u8Result2;
        u8Result2  = (u8Result1 >> 32) | 0x100000000ULL; // add carry
        }
    else
        {
        u8Result1 +=  u8Result2;
        u8Result2  = (u8Result1 >> 32);
        }
    if (u8ProductL > MAX_UINT64 - (u8Result1 <<= 32))
        {
        u8Result2++;
        }
    u8ProductL += u8Result1;
    u8ProductH += u8Result2;
    return;
    }

This function expects two 64-bit values and returns a 128-bit result as two 64-bit quantities passed as reference. This works fine. In the next step, I tried to replace the call to this function by ASM code that calls the CPU's MUL instruction. Since there's no inline ASM in x64 mode anymore, the code must be put into a separate .asm file. This is the implementation:

_TEXT segment

; =============================================================================
; multiplication
; -----------------------------------------------------------------------------
; 64-bit to 128-bit multiplication, using the x64 MUL instruction

AsmMul1 proc ; ?AsmMul1@@$$FYAX_K0AEA_K1@Z

; ecx  : Factor1
; edx  : Factor2
; [r8] : ProductL
; [r9] : ProductH

mov  rax, rcx            ; rax = Factor1
mul  rdx                 ; rdx:rax = Factor1 * Factor2
mov  qword ptr [r8], rax ; [r8] = ProductL
mov  qword ptr [r9], rdx ; [r9] = ProductH
ret

AsmMul1 endp

; =============================================================================

_TEXT ends
end

That's utmost simple and straightforward. The function is referenced from C++ code using an extern "C" forward definition:

extern "C"
    {
    void AsmMul1 (UINT64, UINT64, UINT64&, UINT64&);
    }

To my surprise, it turned out to be significantly slower than the C++ function. To properly benchmark the performance, I've written a C++ function that computes 10,000,000 pairs of pseudo-random unsigned 64-bit values and performs multiplications in a tight loop, using those implementations one after another, with exactly the same values. The code is compiled in Release mode with optimizations turned on. The time spent in the loop is 515 msec for the ASM version, compared to 125 msec (!) for the C++ version.

That's quite strange. So I opened the disassembly window in the debugger and copied the ASM code generated by the compiler. This is what I found there, slightly edited for readability and for use with MASM:

AsmMul3 proc ; ?AsmMul3@@$$FYAX_K0AEA_K1@Z

; ecx  : Factor1
; edx  : Factor2
; [r8] : ProductL
; [r9] : ProductH

mov  eax,  0FFFFFFFFh
and  rax,  rcx

; UINT64 u8Factor2L = u8Factor2 & 0xFFFFFFFFULL;
mov  r10d, 0FFFFFFFFh
and  r10,  rdx

; UINT64 u8Factor1H = u8Factor1 >> 32;
shr  rcx,  20h

; UINT64 u8Factor2H = u8Factor2 >> 32;
shr  rdx,  20h

; u8ProductL = u8Factor1L * u8Factor2L;
mov  r11,  r10
imul r11,  rax
mov  qword ptr [r8], r11

; u8ProductH = u8Factor1H * u8Factor2H;
mov  r11,  rdx
imul r11,  rcx
mov  qword ptr [r9], r11

; u8Result1 = u8Factor1L * u8Factor2H;
imul rax,  rdx

; u8Result2 = u8Factor1H * u8Factor2L;
mov  rdx,  rcx
imul rdx,  r10

; if (u8Result1 > MAX_UINT64 - u8Result2)
mov  rcx,  rdx
neg  rcx
dec  rcx
cmp  rcx,  rax
jae  label1

; u8Result1 += u8Result2;
add  rax,  rdx

; u8Result2 = (u8Result1 >> 32) | 0x100000000ULL; // add carry
mov  rdx,  rax
shr  rdx,  20h
mov  rcx,  100000000h
or   rcx,  rdx
jmp  label2

; u8Result1 += u8Result2;
label1:
add  rax,  rdx

; u8Result2 = (u8Result1 >> 32);
mov  rcx,  rax
shr  rcx,  20h

; if (u8ProductL > MAX_UINT64 - (u8Result1 <<= 32))
label2:
shl  rax,  20h
mov  rdx,  qword ptr [r8]
mov  r10,  rax
neg  r10
dec  r10
cmp  r10,  rdx
jae  label3

; u8Result2++;
inc  rcx

; u8ProductL += u8Result1;
label3:
add  rdx,  rax
mov  qword ptr [r8], rdx

; u8ProductH += u8Result2;
add  qword ptr [r9], rcx
ret

AsmMul3 endp

Copying this code into my MASM source file and calling it from my benchmark routine resulted in 547 msec spent in the loop. That's slightly slower than the ASM function, and considerably slower than the C++ function. That's even stranger, since the latter are supposed to execute exactly the same machine code.

So I tried another variant, this time using hand-optimized ASM code that does exactly the same four 32-to-64-bit multiplications, but in a more straightforward way. The code should avoid jumps and immediate values, make use of the CPU FLAGS for carry evaluation, and use interleaving of instructions in order to avoid register stalls. This is what I came up with:

; 64-bit to 128-bit multiplication, using the following decomposition:
; (a*2^32 + i) (b*2^32 + j) = ab*2^64 + (aj + bi)*2^32 + ij

AsmMul2 proc ; ?AsmMul2@@$$FYAX_K0AEA_K1@Z

; ecx  : Factor1
; edx  : Factor2
; [r8] : ProductL
; [r9] : ProductH

mov  rax,  rcx           ; rax = Factor1
mov  r11,  rdx           ; r11 = Factor2
shr  rax,  32            ; rax = Factor1H
shr  r11,  32            ; r11 = Factor2H
and  ecx,  ecx           ; rcx = Factor1L
mov  r10d, eax           ; r10 = Factor1H
and  edx,  edx           ; rdx = Factor2L

imul rax,  r11           ; rax = ab = Factor1H * Factor2H
imul r10,  rdx           ; r10 = aj = Factor1H * Factor2L
imul r11,  rcx           ; r11 = bi = Factor1L * Factor2H
imul rdx,  rcx           ; rdx = ij = Factor1L * Factor2L

xor  ecx,  ecx           ; rcx = 0
add  r10,  r11           ; r10 = aj + bi
adc  ecx,  ecx           ; rcx = carry (aj + bi)
mov  r11,  r10           ; r11 = aj + bi
shl  rcx,  32            ; rcx = carry (aj + bi) << 32
shl  r10,  32            ; r10 = lower (aj + bi) << 32
shr  r11,  32            ; r11 = upper (aj + bi) >> 32

add  rdx,  r10           ; rdx = ij + (lower (aj + bi) << 32)
adc  rax,  r11           ; rax = ab + (upper (aj + bi) >> 32)
mov  qword ptr [r8], rdx ; save ProductL
add  rax,  rcx           ; add carry (aj + bi) << 32
mov  qword ptr [r9], rax ; save ProductH
ret

AsmMul2 endp

The benchmark yielded 500 msec, so this seems to be the fastest version of those three ASM implementations. However, the performance differences of them are quite marginal - but all of them are about four times slower than the naive C++ approach!

So what's going on here? It seems to me that there's some general performance penalty for calling ASM code from C++, but I can't find anything on the internet that might explain it. The way I'm interfacing ASM is exactly how Microsoft recommends it.

But now, watch out for another still stranger thing! Well, there are compiler intrinsics, anren't they? The _umul128 intrinsic supposedly should do exactly what my AsmMul1 function does, ie call the 64-bit CPU MUL instruction. So I replaced the AsmMul1 call by a corresponding call to _umul128 . Now see what performance values I've got in return (again, I'm running all four benchmarks sequentially in a single function):

_umul128: 109 msec
AsmMul2: 94 msec (hand-optimized ASM)
AsmMul3: 125 msec (compiler-generated ASM)
C++ function: 828 msec

Now the ASM versions are blazingly fast, with about the same relative differences as before. However, the C++ function is terribly lazy now! Somehow the use of an intrinsic turns the entire performance values upside down. Scary...

I haven't got any explanation for this strange behavior, and would be thankful at least for any hints about what's going on here. It would be even better if someone could explain how to get these performance issues under control. Currently I'm quite worried, because obviously a small change in the code can have huge performance impacts. I would like to understand the mechanisms underlying here, and how to get reliable results.

And another thing: Why is the 64-to-128-bit MUL slower than four 64-to-64-bit IMULs?!

Answer 1

After a lot of trial-and-error, and additional extensive research on the Internet, it seems I've found the reason for this strange performance behavior. The magic word is thunking of function entry points. But let me start from the beginning.

One observation I made is that it doesn't really matter which compiler intrinsic is used in order to turn my benchmark results upside down. Actually, it suffices to put a __nop() (CPU NOP opcode) anywhere inside a function to trigger this effect. It works even if it's placed right before the return . More tests have shown that the effect is restricted to the function that contains the intrinsic. The __nop() does nothing with respect to the code flow, but obviously it changes the properties of the containing function.

I've found a question on stackoverflow that seems to tackle a similar problem: How to best avoid double thunking in C++/CLI native types In the comments, the following additional information is found:

One of my own classes in our base library - which uses MFC - is called about a million times. We are seeing massive sporadic performance issues, and firing up the profiler I can see a thunk right at the bottom of this chain. That thunk takes longer than the method call.

That's exactly what I'm observing as well - "something" on the way of the function call is taking about four times longer than my code. Function thunks are explained to some extend in the documentation of the __clrcall modifier and in an article about Double Thunking . In the former, there's a hint to a side effect of using intrinsics:

You can directly call __clrcall functions from existing C++ code that was compiled by using /clr as long as that function has an MSIL implementation. __clrcall functions cannot be called directly from functions that have inline asm and call CPU-specific intrinisics, for example, even if those functions are compiled with /clr.

So, as far as I understand it, a function that contains intrinsics loses its __clrcall modifier which is added automatically when the /clr compiler switch is specified - which is usually the case if the C++ functions should be compiled to native code.

I don't get all of the details of this thunking and double thunking stuff, but obviously it is required to make unmanaged functions callable from managed functions. However, it is possible to switch it off per function by embedding it into a #pragma managed(push, off) / #pragma managed(pop) pair. Unfortunately, this #pragma doesn't work inside namespace blocks, so some editing might be required to place it everywhere where it is supposed to occur.

I've tried this trick, placing all of my native multi-precision code inside this #pragma, and got the following benchmark results:

AsmMul1: 78 msec (64-to-128-bit CPU MUL)
AsmMul2: 94 msec (hand-optimized ASM, 4 x IMUL)
AsmMul3: 125 msec (compiler-generated ASM, 4 x IMUL)
C++ function: 109 msec

Now this looks reasonable, finally! Now all versions have about the same execution times, which is what I would expect from an optimized C++ program. Alas, there's still no happy end... Placing the winner AsmMul1 into my multi-precision multiplier yielded twice the execution time of the version with the C++ function without #pragma. The explanation is, in my opinion, that this code makes calls to unmanaged functions in other classes, which are outside the #pragma and hence have a __clrcall modifier. This seems to create significant overhead again.

Frankly, I'm tired of investigating further into this issue. Although the ASM PROC with the single MUL instruction seems to beat all other attempts, the gain is not as big as expected, and getting the thunking out of the way leads to so many changes in my code that I don't think it's worth the hassle. So I'll go on with the C++ function I've written in the very beginning, originally destined to be just a placeholder for something better...

It seems to me that ASM interfacing in C++/CLI is not well supported, or maybe I'm still missing something basic here. Maybe there's a way to get this function thunking out of the way for just the ASM functions, but so far I haven't found a solution. Not even remotely.

Feel free to add your own thoughts and observations here - even if they are just speculative. I think it's still a highly interesting topic that needs much more investigation.

Calling MASM PROC from C++/CLI in x64 mode yields unexpected performance problems

Question

1 answers

solution1
2 ACCPTED 2019-03-21 15:01:48

Calling MASM PROC from C++/CLI in x64 mode yields unexpected performance problems

Question

1 answers

solution1 2 ACCPTED 2019-03-21 15:01:48

solution1
2 ACCPTED 2019-03-21 15:01:48