简体   繁体   English

未受管理的x86和x64中的托管互操作性能

[英]Unmanaged to managed interop performance in x86 and x64

In my tests I'm seeing the performance cost of unmanaged to managed interop double when compiling for x64 instead of x86. 在我的测试中,我看到在编译x64而不是x86时,非托管到托管互操作性能的性能成本。 What is causing this slowdown? 导致这种放缓的原因是什么?

I'm testing release builds not running under the debugger. 我正在测试不在调试器下运行的发布版本。 The loop is 100,000,000 iterations. 该循环是100,000,000次迭代。

In x86 I'm measuring an average of 8ns per interop call, which seems to match what I've seen in other places. 在x86中,我每次互操作平均测量8ns,这似乎与我在其他地方看到的相匹配。 Unity's x86 interop is 8.2ns. Unity的x86互操作是8.2ns。 A Microsoft article and Hans Passant both mention 7ns. 一篇微软文章和Hans Passant都提到了7ns。 8ns is 28 clock cycles on my machine which seems at least reasonable, though I do wonder if it's possible to go faster. 8ns在我的机器上是28个时钟周期,这似乎至少是合理的,但我不知道是否可能更快。

In x64 I'm measuring an average of 17ns per interop call. 在x64中,我每次互操作平均测量17ns。 I can't find anyone mentioning a difference between x86 and x64, or even mentioning which they are referring to when giving times. 我找不到任何人提到x86和x64之间的区别,甚至提到他们在给出时间时指的是什么。 Unity's x64 interop clocks in around 5.9ns. Unity的x64互操作时钟约为5.9ns。

Regular function calls (including into an unmanaged C++ DLL) cost an average of 1.3ns. 常规函数调用(包括非托管C ++ DLL)平均花费1.3ns。 This doesn't change significantly between x86 and x64. 这在x86和x64之间没有显着变化。

Below is my minimal C++/CLI code for measuring this, though I'm seeing the same numbers in my actual project that consists of a native C++ project calling into the managed side of a C++/CLI DLL. 下面是我测量它的最小C ++ / CLI代码,虽然我在实际项目中看到相同的数字,它包含一个调用C ++ / CLI DLL管理端的本机C ++项目。

#pragma managed
void
ManagedUpdate()
{
}


#pragma unmanaged
#include <wtypes.h>
#include <cstdint>
#include <cwchar>

struct ProfileSample
{
    static uint64_t frequency;
    uint64_t startTick;
    wchar_t* name;
    int count;

    ProfileSample(wchar_t* name_, int count_)
    {
        name = name_;
        count = count_;

        LARGE_INTEGER win32_startTick;
        QueryPerformanceCounter(&win32_startTick);
        startTick = win32_startTick.QuadPart;
    }

    ~ProfileSample()
    {
        LARGE_INTEGER win32_endTick;
        QueryPerformanceCounter(&win32_endTick);
        uint64_t endTick = win32_endTick.QuadPart;

        uint64_t deltaTicks = endTick - startTick;
        double nanoseconds = (double) deltaTicks / (double) frequency * 1000000000.0 / count;

        wchar_t buffer[128];
        swprintf(buffer, _countof(buffer), L"%s - %.4f ns\n", name, nanoseconds);
        OutputDebugStringW(buffer);

        if (!IsDebuggerPresent())
            MessageBoxW(nullptr, buffer, nullptr, 0);
    }
};

uint64_t ProfileSample::frequency = 0;

int CALLBACK
WinMain(HINSTANCE, HINSTANCE, PSTR, INT)
{
    LARGE_INTEGER frequency;
    QueryPerformanceFrequency(&frequency);
    ProfileSample::frequency = frequency.QuadPart;

    //Warm stuff up
    for ( size_t i = 0; i < 100; i++ )
        ManagedUpdate();

    const int num = 100000000;
    {
        ProfileSample p(L"ManagedUpdate", num);

        for ( size_t i = 0; i < num; i++ )
            ManagedUpdate();
    }

    return 0;
}

1) Why does x64 interop cost 17ns when x86 interop costs 8ns 1)当x86互操作成本为8ns时,为什么x64互操作成本为17ns

2) Is 8ns the fastest I can reasonably expect to go? 2)8ns是我能合理期望的最快的吗?

Edit 1 编辑1

Additional information CPU i7-4770k @ 3.5 GHz 附加信息CPU i7-4770k @ 3.5 GHz
Test case is a single C++/CLI project in VS2017. 测试用例是VS2017中的单个C ++ / CLI项目。
Default Release configuration 默认发布配置
Full optimization /O2 全面优化/ O2
I've randomly played with settings like Favor Size or Speed, Omit Frame Pointers, Enable C++ Exceptions, and Security Check and none appear to change the x86/x64 discrepancy. 我随机播放了喜欢大小或速度,忽略帧指针,启用C ++异常和安全检查等设置,似乎没有更改x86 / x64差异。

Edit 2 编辑2

I've stepped through the disassembly (not something I'm very familiar with at this point). 我已经完成了拆卸(此时我不是很熟悉)。

In x86 is seem something along the lines of 在x86中似乎有些东西

call    ManagedUpdate
jmp     ptr [__mep@?ManagedUpdate@@$$FYAXXZ]
jmp     _IJWNOADThunkJumpTarget@0

In x64 I see 在x64中,我看到了

call    ManagedUpdate
jmp     ptr [__mep@?ManagedUpdate@@$$FYAXXZ]
        //Some jumping around that quickly leads to IJWNOADThunk::MakeCall:
call    IJWNOADThunk::FindThunkTarget
        //MakeCall uses the result from FindThunkTarget to jump into UMThunkStub:

FindThunkTarget is pretty heavy and it looks like most of the time is being spent there. FindThunkTarget相当沉重,看起来大部分时间都在那里度过。 So my working theory is that in x86 the thunk target is known and execution can more or less jump straight to it. 所以我的工作理论是在x86中thunk目标是已知的,执行可以或多或少地直接跳转到它。 But in x64 the thunk target is not known and a search process takes place to find it before being able to jump to it. 但是在x64中,thunk目标是未知的,并且在能够跳转到它之前进行搜索过程以找到它。 I wonder why that is? 我不知道这是为什么?

I have no recollection of ever giving a perf guarantee on code like this. 我没有回忆过这样的代码永远保证。 7 nanoseconds is the kind of perf you can expect on C++ Interop code, managed code calling native code. 7纳秒是您在C ++ Interop代码上可以预期的一种性能,托管代码调用本机代码。 This is going the other way around, native code calling managed code, aka "reverse pinvoke". 这反过来说,本机代码调用托管代码,又名“反向pinvoke”。

You are definitely getting the slow flavor of this kind of interop. 你肯定会得到这种互操作的缓慢味道。 The "No AD" in IJWNOADThunk is the nasty little detail as far as I can see. 就我所见,IJWNOADThunk中的“No AD”是令人讨厌的小细节。 This code did not get the micro-optimization love that is common in interop stubs. 这段代码没有得到互操作存根中常见的微优化爱。 It is also highly specific to C++/CLI code. 它也非常特定于C ++ / CLI代码。 Nasty because it cannot assume anything about the AppDomain in which the managed code needs to run. 令人讨厌,因为它无法假设托管代码需要运行的AppDomain。 In fact, it cannot even assume that the CLR is loaded and initialized. 实际上,它甚至不能假设CLR已加载并初始化。

Is 8ns the fastest I can reasonably expect to go? 8ns是我能合理期望的最快速度吗?

Yes. 是。 You are in fact on the very low end with this measurement. 事实上,这种测量非常低端。 Your hardware is a lot beefier than mine, I'm testing this on a mobile Haswell. 你的硬件比我的硬件强很多,我在移动Haswell上进行测试。 I'm seeing between ~26 and 43 nanosec for x86, between ~40 and 46 nanosec for x64. 我看到x86在~26和43纳秒之间,x64在~40到46纳秒之间。 So you are getting x3 better times, pretty impressive. 所以你获得x3更好的时间,非常令人印象深刻。 Frankly, a bit too impressive but you are seeing the same code that I do so we must be measuring the same scenario. 坦率地说,有点太令人印象深刻,但你看到我所做的相同代码,所以我们必须测量相同的场景。

Why does x64 interop cost 17ns when x86 interop costs 8ns? 当x86互操作成本为8ns时,为什么x64互操作成本为17ns?

This is not optimal code, the Microsoft programmer was very pessimistic about what corners he could cut. 这不是最佳代码,微软程序员对他可以削减的角落非常悲观。 I have no real insight whether that was warranted, the comments in UMThunkStub.asm don't explain anything about choices. 我对这是否有必要没有真正的了解,UMThunkStub.asm中的评论没有解释任何有关选择的内容。

There is not anything particularly special about reverse pinvoke. 反向pinvoke没有什么特别之处。 Happens all the time in, say, a GUI program that processes Windows messages. 例如,一个处理Windows消息的GUI程序就会发生这种情况。 But that is done very differently, such code uses a delegate. 但这样做完全不同,这样的代码使用委托。 Which is the way to get ahead and make this faster. 哪种方法可以更快地取得成功。 Using Marshal::GetFunctionPointerForDelegate() is the key. 使用Marshal :: GetFunctionPointerForDelegate()是关键。 I tried this approach: 我试过这种方法:

using namespace System;
using namespace System::Runtime::InteropServices;


void* GetManagedUpdateFunctionPointer() {
    auto dlg = gcnew Action(&ManagedUpdate);
    auto tobereleased = GCHandle::Alloc(dlg);
    return Marshal::GetFunctionPointerForDelegate(dlg).ToPointer();
}

And used like this in the WinMain() function: 并在WinMain()函数中使用如下:

typedef void(__stdcall * testfuncPtr)();
testfuncPtr fptr = (testfuncPtr)GetManagedUpdateFunctionPointer();
//Warm stuff up
for (size_t i = 0; i < 100; i++) fptr();

    //...
    for ( size_t i = 0; i < num; i++ ) fptr();

Which made the x86 version a little faster. 这使得x86版本更快一点。 And the x64 version just as fast. 和x64版一样快。

If you are going to use this approach then keep in mind that an instance method as the delegate target is faster than a static method in x64 code, the call stub has less work to do to rearrange the function arguments. 如果要使用此方法,请记住,作为委托目标的实例方法比x64代码中的静态方法更快,调用存根重新排列函数参数的工作量较少。 And do beware I took a shortcut on the tobereleased variable, there is a possible memory management detail here and a GCHandle::Free() call might be preferred or necessary in a plug-in scenario. 并且要注意我在tobereleased变量上使用了一个快捷方式,这里有一个可能的内存管理细节,并且在插件方案中可能首选或必需GCHandle :: Free()调用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM