在 x64 模式下从 C++/CLI 调用 MASM PROC 会产生意外的性能问题

Question

I'm writing an arbitrary precision integer class to be used in C# (64-bit).我正在编写一个在 C#（64 位）中使用的任意精度整数类。 Currently I'm working on the multiplication routine, using a recursive divide-and-conquer algorithm to break down the multi-bit multiplication into a series of primitive 64-to-128-bit multiplications, the results of which are recombined then by simple addition.目前我正在研究乘法例程，使用递归分治算法将多位乘法分解为一系列原始的 64 到 128 位乘法，然后通过简单的方法重新组合其结果添加。 In order to get a significant performance boost, I'm writing the code in native x64 C++, embedded in a C++/CLI wrapper to make it callable from C# code.为了获得显着的性能提升，我使用原生 x64 C++ 编写代码，嵌入在 C++/CLI 包装器中，使其可从 C# 代码调用。

It all works great so far, regarding the algorithms.到目前为止，关于算法，一切都很好。 However, my problem is the optimization for speed.但是，我的问题是速度的优化。 Since the 64-to-128-bit multiplication is the real bottleneck here, I tried to optimize my code right there.由于 64 位到 128 位的乘法是这里真正的瓶颈，我尝试在那里优化我的代码。 My first simple approach was a C++ function that implements this multiplication by performing four 32-to-64-bit multiplications and recombining the results with a couple of shifts and adds.我的第一个简单方法是一个 C++ 函数，它通过执行四个 32 位到 64 位乘法并将结果与几个移位和加法重新组合来实现这种乘法。 This is the source code:这是源代码：

// 64-bit to 128-bit multiplication, using the following decomposition:
// (a*2^32 + i) (b*2^32 + i) = ab*2^64 + (aj + bi)*2^32 + ij

public: static void Mul (UINT64  u8Factor1,
                         UINT64  u8Factor2,
                         UINT64& u8ProductL,
                         UINT64& u8ProductH)
    {
    UINT64 u8Result1, u8Result2;
    UINT64 u8Factor1L = u8Factor1 & 0xFFFFFFFFULL;
    UINT64 u8Factor2L = u8Factor2 & 0xFFFFFFFFULL;
    UINT64 u8Factor1H = u8Factor1 >> 32;
    UINT64 u8Factor2H = u8Factor2 >> 32;

    u8ProductL = u8Factor1L * u8Factor2L;
    u8ProductH = u8Factor1H * u8Factor2H;
    u8Result1  = u8Factor1L * u8Factor2H;
    u8Result2  = u8Factor1H * u8Factor2L;

    if (u8Result1 > MAX_UINT64 - u8Result2)
        {
        u8Result1 +=  u8Result2;
        u8Result2  = (u8Result1 >> 32) | 0x100000000ULL; // add carry
        }
    else
        {
        u8Result1 +=  u8Result2;
        u8Result2  = (u8Result1 >> 32);
        }
    if (u8ProductL > MAX_UINT64 - (u8Result1 <<= 32))
        {
        u8Result2++;
        }
    u8ProductL += u8Result1;
    u8ProductH += u8Result2;
    return;
    }

This function expects two 64-bit values and returns a 128-bit result as two 64-bit quantities passed as reference.此函数需要两个 64 位值并返回一个 128 位结果作为两个 64 位数量作为参考传递。 This works fine.这工作正常。 In the next step, I tried to replace the call to this function by ASM code that calls the CPU's MUL instruction.在下一步中，我尝试用调用 CPU 的 MUL 指令的 ASM 代码替换对这个函数的调用。 Since there's no inline ASM in x64 mode anymore, the code must be put into a separate .asm file.由于 x64 模式下不再有内联 ASM，因此必须将代码放入单独的 .asm 文件中。 This is the implementation:这是实现：

_TEXT segment

; =============================================================================
; multiplication
; -----------------------------------------------------------------------------
; 64-bit to 128-bit multiplication, using the x64 MUL instruction

AsmMul1 proc ; ?AsmMul1@@$$FYAX_K0AEA_K1@Z

; ecx  : Factor1
; edx  : Factor2
; [r8] : ProductL
; [r9] : ProductH

mov  rax, rcx            ; rax = Factor1
mul  rdx                 ; rdx:rax = Factor1 * Factor2
mov  qword ptr [r8], rax ; [r8] = ProductL
mov  qword ptr [r9], rdx ; [r9] = ProductH
ret

AsmMul1 endp

; =============================================================================

_TEXT ends
end

That's utmost simple and straightforward.这非常简单直接。 The function is referenced from C++ code using an extern "C" forward definition:该函数是使用extern "C"前向定义从 C++ 代码引用的：

extern "C"
    {
    void AsmMul1 (UINT64, UINT64, UINT64&, UINT64&);
    }

To my surprise, it turned out to be significantly slower than the C++ function.令我惊讶的是，结果证明它比 C++ 函数慢得多。 To properly benchmark the performance, I've written a C++ function that computes 10,000,000 pairs of pseudo-random unsigned 64-bit values and performs multiplications in a tight loop, using those implementations one after another, with exactly the same values.为了正确地对性能进行基准测试，我编写了一个 C++ 函数，该函数计算 10,000,000 对伪随机无符号 64 位值并在紧密循环中执行乘法，使用这些实现一个接一个，并使用完全相同的值。 The code is compiled in Release mode with optimizations turned on.代码在 Release 模式下编译并启用优化。 The time spent in the loop is 515 msec for the ASM version, compared to 125 msec (!) for the C++ version. ASM 版本在循环中花费的时间为 515 毫秒，而 C++ 版本为 125 毫秒 (!)。

That's quite strange.这很奇怪。 So I opened the disassembly window in the debugger and copied the ASM code generated by the compiler.于是我在调试器中打开了反汇编窗口，复制了编译器生成的ASM代码。 This is what I found there, slightly edited for readability and for use with MASM:这是我在那里找到的，为了可读性和与 MASM 一起使用而略作编辑：

AsmMul3 proc ; ?AsmMul3@@$$FYAX_K0AEA_K1@Z

; ecx  : Factor1
; edx  : Factor2
; [r8] : ProductL
; [r9] : ProductH

mov  eax,  0FFFFFFFFh
and  rax,  rcx

; UINT64 u8Factor2L = u8Factor2 & 0xFFFFFFFFULL;
mov  r10d, 0FFFFFFFFh
and  r10,  rdx

; UINT64 u8Factor1H = u8Factor1 >> 32;
shr  rcx,  20h

; UINT64 u8Factor2H = u8Factor2 >> 32;
shr  rdx,  20h

; u8ProductL = u8Factor1L * u8Factor2L;
mov  r11,  r10
imul r11,  rax
mov  qword ptr [r8], r11

; u8ProductH = u8Factor1H * u8Factor2H;
mov  r11,  rdx
imul r11,  rcx
mov  qword ptr [r9], r11

; u8Result1 = u8Factor1L * u8Factor2H;
imul rax,  rdx

; u8Result2 = u8Factor1H * u8Factor2L;
mov  rdx,  rcx
imul rdx,  r10

; if (u8Result1 > MAX_UINT64 - u8Result2)
mov  rcx,  rdx
neg  rcx
dec  rcx
cmp  rcx,  rax
jae  label1

; u8Result1 += u8Result2;
add  rax,  rdx

; u8Result2 = (u8Result1 >> 32) | 0x100000000ULL; // add carry
mov  rdx,  rax
shr  rdx,  20h
mov  rcx,  100000000h
or   rcx,  rdx
jmp  label2

; u8Result1 += u8Result2;
label1:
add  rax,  rdx

; u8Result2 = (u8Result1 >> 32);
mov  rcx,  rax
shr  rcx,  20h

; if (u8ProductL > MAX_UINT64 - (u8Result1 <<= 32))
label2:
shl  rax,  20h
mov  rdx,  qword ptr [r8]
mov  r10,  rax
neg  r10
dec  r10
cmp  r10,  rdx
jae  label3

; u8Result2++;
inc  rcx

; u8ProductL += u8Result1;
label3:
add  rdx,  rax
mov  qword ptr [r8], rdx

; u8ProductH += u8Result2;
add  qword ptr [r9], rcx
ret

AsmMul3 endp

Copying this code into my MASM source file and calling it from my benchmark routine resulted in 547 msec spent in the loop.将此代码复制到我的 MASM 源文件中并从我的基准程序中调用它导致在循环中花费 547 毫秒。 That's slightly slower than the ASM function, and considerably slower than the C++ function.这比 ASM 函数稍慢，但比 C++ 函数慢得多。 That's even stranger, since the latter are supposed to execute exactly the same machine code.这更奇怪，因为后者应该执行完全相同的机器代码。

So I tried another variant, this time using hand-optimized ASM code that does exactly the same four 32-to-64-bit multiplications, but in a more straightforward way.所以我尝试了另一种变体，这次使用手工优化的 ASM 代码，它执行完全相同的 4 次 32 到 64 位乘法，但方式更直接。 The code should avoid jumps and immediate values, make use of the CPU FLAGS for carry evaluation, and use interleaving of instructions in order to avoid register stalls.代码应避免跳转和立即值，利用 CPU 标志进行进位评估，并使用指令交错以避免寄存器停顿。 This is what I came up with:这就是我想出的：

; 64-bit to 128-bit multiplication, using the following decomposition:
; (a*2^32 + i) (b*2^32 + j) = ab*2^64 + (aj + bi)*2^32 + ij

AsmMul2 proc ; ?AsmMul2@@$$FYAX_K0AEA_K1@Z

; ecx  : Factor1
; edx  : Factor2
; [r8] : ProductL
; [r9] : ProductH

mov  rax,  rcx           ; rax = Factor1
mov  r11,  rdx           ; r11 = Factor2
shr  rax,  32            ; rax = Factor1H
shr  r11,  32            ; r11 = Factor2H
and  ecx,  ecx           ; rcx = Factor1L
mov  r10d, eax           ; r10 = Factor1H
and  edx,  edx           ; rdx = Factor2L

imul rax,  r11           ; rax = ab = Factor1H * Factor2H
imul r10,  rdx           ; r10 = aj = Factor1H * Factor2L
imul r11,  rcx           ; r11 = bi = Factor1L * Factor2H
imul rdx,  rcx           ; rdx = ij = Factor1L * Factor2L

xor  ecx,  ecx           ; rcx = 0
add  r10,  r11           ; r10 = aj + bi
adc  ecx,  ecx           ; rcx = carry (aj + bi)
mov  r11,  r10           ; r11 = aj + bi
shl  rcx,  32            ; rcx = carry (aj + bi) << 32
shl  r10,  32            ; r10 = lower (aj + bi) << 32
shr  r11,  32            ; r11 = upper (aj + bi) >> 32

add  rdx,  r10           ; rdx = ij + (lower (aj + bi) << 32)
adc  rax,  r11           ; rax = ab + (upper (aj + bi) >> 32)
mov  qword ptr [r8], rdx ; save ProductL
add  rax,  rcx           ; add carry (aj + bi) << 32
mov  qword ptr [r9], rax ; save ProductH
ret

AsmMul2 endp

The benchmark yielded 500 msec, so this seems to be the fastest version of those three ASM implementations.基准测试产生了 500 毫秒，因此这似乎是这三个 ASM 实现中最快的版本。 However, the performance differences of them are quite marginal - but all of them are about four times slower than the naive C++ approach!然而，它们的性能差异非常小——但它们都比简单的 C++ 方法慢四倍！

So what's going on here?那么这里发生了什么？ It seems to me that there's some general performance penalty for calling ASM code from C++, but I can't find anything on the internet that might explain it.在我看来，从 C++ 调用 ASM 代码会有一些一般的性能损失，但我在互联网上找不到任何可以解释它的东西。 The way I'm interfacing ASM is exactly how Microsoft recommends it.我与 ASM 交互的方式正是 Microsoft 推荐的方式。

But now, watch out for another still stranger thing!但是现在，要注意另一个更奇怪的事情！ Well, there are compiler intrinsics, anren't they?嗯，有编译器内在函数，不是吗？ The _umul128 intrinsic supposedly should do exactly what my AsmMul1 function does, ie call the 64-bit CPU MUL instruction. _umul128内在函数应该完全符合我的 AsmMul1 函数的作用，即调用 64 位 CPU MUL 指令。 So I replaced the AsmMul1 call by a corresponding call to _umul128 .所以我换成了相应的调用AsmMul1呼叫_umul128 。 Now see what performance values I've got in return (again, I'm running all four benchmarks sequentially in a single function):现在看看我得到了什么性能值（同样，我在一个函数中按顺序运行所有四个基准）：

_umul128: 109 msec
AsmMul2: 94 msec (hand-optimized ASM)
AsmMul3: 125 msec (compiler-generated ASM)
C++ function: 828 msec

Now the ASM versions are blazingly fast, with about the same relative differences as before.现在 ASM 版本速度非常快，相对差异与以前大致相同。 However, the C++ function is terribly lazy now!但是，C++ 函数现在非常懒惰！ Somehow the use of an intrinsic turns the entire performance values upside down.以某种方式使用内在函数会颠倒整个性能值。 Scary...害怕...

I haven't got any explanation for this strange behavior, and would be thankful at least for any hints about what's going on here.我对这种奇怪的行为没有任何解释，至少会感谢有关这里发生的事情的任何提示。 It would be even better if someone could explain how to get these performance issues under control.如果有人能解释如何控制这些性能问题，那就更好了。 Currently I'm quite worried, because obviously a small change in the code can have huge performance impacts.目前我很担心，因为代码中的一个小改动显然会对性能产生巨大的影响。 I would like to understand the mechanisms underlying here, and how to get reliable results.我想了解这里的基本机制，以及如何获得可靠的结果。

And another thing: Why is the 64-to-128-bit MUL slower than four 64-to-64-bit IMULs?!还有一件事：为什么 64 到 128 位 MUL 比四个 64 到 64 位 IMUL 慢？！

Answer 1

After a lot of trial-and-error, and additional extensive research on the Internet, it seems I've found the reason for this strange performance behavior.经过大量的反复试验以及在 Internet 上的其他广泛研究，我似乎找到了这种奇怪的性能行为的原因。 The magic word is thunking of function entry points.神奇的词是函数入口点的重击。 But let me start from the beginning.但让我从头开始。

One observation I made is that it doesn't really matter which compiler intrinsic is used in order to turn my benchmark results upside down.我所做的一个观察是，为了颠倒我的基准测试结果而使用哪个内部编译器并不重要。 Actually, it suffices to put a __nop() (CPU NOP opcode) anywhere inside a function to trigger this effect.实际上，将__nop() （CPU NOP 操作码）放在函数内的任何位置即可触发此效果。 It works even if it's placed right before the return .即使它放在return之前，它也能工作。 More tests have shown that the effect is restricted to the function that contains the intrinsic.更多的测试表明，效果仅限于包含内在函数的函数。 The __nop() does nothing with respect to the code flow, but obviously it changes the properties of the containing function. __nop()对代码流没有任何作用，但显然它改变了包含函数的属性。

I've found a question on stackoverflow that seems to tackle a similar problem: How to best avoid double thunking in C++/CLI native types In the comments, the following additional information is found:我在 stackoverflow 上发现了一个似乎解决了类似问题的问题： How to best avoid double thunking in C++/CLI native types在评论中，找到了以下附加信息：

One of my own classes in our base library - which uses MFC - is called about a million times.我们基础库中我自己的一个类（使用 MFC）被调用了大约一百万次。 We are seeing massive sporadic performance issues, and firing up the profiler I can see a thunk right at the bottom of this chain.我们看到了大量零星的性能问题，启动分析器我可以看到在这条链的底部有一个 thunk。 That thunk takes longer than the method call.该 thunk 比方法调用花费的时间更长。

That's exactly what I'm observing as well - "something" on the way of the function call is taking about four times longer than my code.这也正是我所观察到的 - 函数调用过程中的“某事”比我的代码花费的时间大约长四倍。 Function thunks are explained to some extend in the documentation of the __clrcall modifier and in an article about Double Thunking .函数 thunk 在__clrcall 修饰符的文档和关于Double Thunking的文章中进行了一定程度的解释。 In the former, there's a hint to a side effect of using intrinsics:在前者中，有一个使用内在函数的副作用的提示：

You can directly call __clrcall functions from existing C++ code that was compiled by using /clr as long as that function has an MSIL implementation.您可以直接从使用 /clr 编译的现有 C++ 代码调用 __clrcall 函数，只要该函数具有 MSIL 实现。 __clrcall functions cannot be called directly from functions that have inline asm and call CPU-specific intrinisics, for example, even if those functions are compiled with /clr. __clrcall 函数不能直接从具有内联 asm 和调用特定于 CPU 的内部函数的函数调用，例如，即使这些函数是使用 /clr 编译的。

So, as far as I understand it, a function that contains intrinsics loses its __clrcall modifier which is added automatically when the /clr compiler switch is specified - which is usually the case if the C++ functions should be compiled to native code.因此，据我所知，包含内在函数的函数会丢失其__clrcall修饰符，该修饰符在指定 /clr 编译器开关时自动添加 - 如果 C++ 函数应编译为本机代码，通常就是这种情况。

I don't get all of the details of this thunking and double thunking stuff, but obviously it is required to make unmanaged functions callable from managed functions.我没有得到这个 thunking 和 double thunking 的所有细节，但显然需要使非托管函数可以从托管函数调用。 However, it is possible to switch it off per function by embedding it into a #pragma managed(push, off) / #pragma managed(pop) pair.但是，可以通过将其嵌入#pragma managed(push, off) / #pragma managed(pop)对来按功能关闭它。 Unfortunately, this #pragma doesn't work inside namespace blocks, so some editing might be required to place it everywhere where it is supposed to occur.不幸的是，这个#pragma 在命名空间块内不起作用，因此可能需要进行一些编辑才能将它放置在它应该出现的任何地方。

I've tried this trick, placing all of my native multi-precision code inside this #pragma, and got the following benchmark results:我已经尝试过这个技巧，将我所有的本机多精度代码放在这个 #pragma 中，并得到以下基准测试结果：

AsmMul1: 78 msec (64-to-128-bit CPU MUL)
AsmMul2: 94 msec (hand-optimized ASM, 4 x IMUL)
AsmMul3: 125 msec (compiler-generated ASM, 4 x IMUL)
C++ function: 109 msec

Now this looks reasonable, finally!现在这看起来很合理，终于！ Now all versions have about the same execution times, which is what I would expect from an optimized C++ program.现在所有版本的执行时间大致相同，这是我对优化的 C++ 程序的期望。 Alas, there's still no happy end... Placing the winner AsmMul1 into my multi-precision multiplier yielded twice the execution time of the version with the C++ function without #pragma.唉，仍然没有幸福的结局......将获胜者AsmMul1放入我的多精度乘法器中，使用 C++ 函数而不使用 #pragma 产生了两倍的执行时间。 The explanation is, in my opinion, that this code makes calls to unmanaged functions in other classes, which are outside the #pragma and hence have a __clrcall modifier.在我看来，解释是此代码调用了其他类中的非托管函数，这些函数在 #pragma 之外，因此具有__clrcall修饰符。 This seems to create significant overhead again.这似乎再次产生了显着的开销。

Frankly, I'm tired of investigating further into this issue.坦率地说，我已经厌倦了进一步调查这个问题。 Although the ASM PROC with the single MUL instruction seems to beat all other attempts, the gain is not as big as expected, and getting the thunking out of the way leads to so many changes in my code that I don't think it's worth the hassle.尽管带有单个 MUL 指令的 ASM PROC 似乎击败了所有其他尝试，但收益并不像预期的那么大，而且摆脱重击会导致我的代码发生如此多的变化，我认为这不值得麻烦。 So I'll go on with the C++ function I've written in the very beginning, originally destined to be just a placeholder for something better...所以我将继续使用我在一开始编写的 C++ 函数，它最初注定只是一个更好的东西的占位符......

It seems to me that ASM interfacing in C++/CLI is not well supported, or maybe I'm still missing something basic here.在我看来，C++/CLI 中的 ASM 接口没有得到很好的支持，或者我在这里仍然缺少一些基本的东西。 Maybe there's a way to get this function thunking out of the way for just the ASM functions, but so far I haven't found a solution.也许有一种方法可以让这个函数只针对 ASM 函数，但到目前为止我还没有找到解决方案。 Not even remotely.甚至远程。

Feel free to add your own thoughts and observations here - even if they are just speculative.随意在此处添加您自己的想法和观察 - 即使它们只是推测性的。 I think it's still a highly interesting topic that needs much more investigation.我认为这仍然是一个非常有趣的话题，需要更多的调查。

在 x64 模式下从 C++/CLI 调用 MASM PROC 会产生意外的性能问题

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-03-21 15:01:48

在 x64 模式下从 C++/CLI 调用 MASM PROC 会产生意外的性能问题

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-03-21 15:01:48

解决方案1
2 已采纳 2019-03-21 15:01:48