简体繁体 English

为什么我的记忆基准会给出奇怪的结果？

[英]Why are my memory benchmarks giving strange results?

原文 2016-03-13 18:42:52 9 1 c#/ arrays/ benchmarking

I have recently been running some basic benchmarks written in C# to try to determine the reason why some seemingly identical HyperV remote workstations seem to be running far slower than others. 我最近运行了一些用C＃编写的基本基准测试，试图确定一些看似相同的HyperV远程工作站运行速度远远低于其他工作站的原因。 Their results on most of the basic tests that I am running have been totally identical, but the results from a basic memory access benchmark (specifically, time taken to initialise a 2 dimensional 1000x1000 array of doubles to 0) differ by a factor of 40. 他们在我运行的大多数基本测试中的结果完全相同，但基本内存访问基准测试的结果（具体地说，将二维1000x1000双精度数组初始化为0的时间）相差40倍。

To further investigate this issue, I have run several other experiments to further narrow down the issue. 为了进一步研究这个问题，我已经进行了其他几个实验来进一步缩小问题的范围。 Running the same test with an exponentially increasing array size (until an OutOfMemoryException occurs) shows no difference between the various remotes until the array size is over 1m, then an immediate difference of a factor of around 40. In fact, testing incremental array sizes, time taken to initialise increases proportionally to array size until an array size of exactly 999999, then on the'slow' remotes, the time taken increases by 900%, while on the 'fast' remotes it decreases by 70% as the array size reaches 1000x1000. 以指数级增加的数组大小运行相同的测试（直到发生OutOfMemoryException）显示各种遥控器之间没有差异，直到数组大小超过1米，然后立即差异大约为40。实际上，测试增量数组大小，初始化时间与数组大小成比例增加，直到阵列大小正好为999999，然后在低速遥控器上，所用时间增加了900％，而在“快速”遥控器上，随着阵列大小的增加，它减少了70％ 1000×1000。 From there, it continues to scale proportionally. 从那里，它继续按比例扩大。 The same phenomenon also happens with array sizes of 1m x 1 and 1 x 1m, though to a much smaller extent (changes of +50% and -30% instead. 阵列尺寸为1m x 1和1 x 1m时也会出现同样的现象，但程度要小得多（相反，变化为+ 50％和-30％）。

Interestingly, changing the data type used for the experiment to floats appears to completely eliminate this phenomenon. 有趣的是，将用于实验的数据类型更改为浮点似乎可以完全消除这种现象。 No difference occurs between the remotes in any test, and the time taken appears to be entirely proportional even over the 1000*1000 and 2000*2000 breakpoints. 在任何测试中，遥控器之间没有差异，并且即使在1000 * 1000和2000 * 2000断点上，所花费的时间似乎完全成比例。 Another interesting factor is that the local workstation that I am using's behaviour appears to mirror that of the slower remotes. 另一个有趣的因素是我使用的本地工作站的行为似乎反映了较慢的遥控器。

Does anybody have any idea what settings in the system configuration might be causing this effect and how it might be changed, or what might be done to further debug the issue? 是否有人知道系统配置中的哪些设置可能导致此影响以及如何更改，或者可以采取哪些措施来进一步调试问题？

1 个解决方案

You'll have to keep in mind what you are really testing. 你必须记住你真正在测试的是什么。 Which is most certainly not a .NET program's ability to assign array elements. 这当然不是.NET程序分配数组元素的能力。 That is very fast and normally proceeds are memory bus bandrates for a big array, typically ~37 gigabytes/second depending on the kind of RAM the machine has, 5 GB/sec on the pokiest kind you could run into today (slow clocked DDR2 on an old machine). 这是非常快的，通常进行的是一个大阵列的内存总线频段，通常约为37千兆字节/秒，具体取决于机器的RAM类型，5 GB /秒是您今天可能遇到的最强类型（慢速时钟DDR2开启）一台旧机器）。

The new keyword only allocates address space on a demand-paged virtual memory operating system like Windows. new关键字仅在需求分页的虚拟内存操作系统（如Windows）上分配地址空间 。 Just numbers to the processor, one each for every 4096 bytes. 只是处理器的数字，每4096字节一个。

Once you start assigning elements the first time , the demand-paged feature kicks in and your code forces the operating system to allocate RAM for the array. 一旦您第一次开始分配元素，需求分页功能就会启动，您的代码会强制操作系统为阵列分配RAM。 The array element assignment triggers a page fault, one for each 4096 bytes in the array. 数组元素分配触发页面错误，数组中每个4096字节一个错误。 Or 512 doubles for your array. 或者你的阵列有512个双打。 The cost of handling the page fault is included in your measurement. 处理页面错误的成本包含在您的测量中。

That's smooth sailing only when the OS has a zero-initialized RAM page ready to be used. 只有当OS具有准备好使用的零初始化RAM页面时，才能顺利进行。 Usually takes a fat half a microsecond, give or take. 通常需要脂肪半微秒，给予或采取。 Still a lot of a time to a processor, it will be stalled when the OS updates the page mapping. 处理器仍有很多时间，当操作系统更新页面映射时，它将停滞不前。 Keep in mind that this only happens on the first element access, subsequent ones are fast since the RAM page will still be available. 请记住，这只发生在第一个元素访问，后续的快速，因为RAM页面仍然可用。 Usually. 通常。

It is not smooth sailing when such a RAM page is not available. 当这样的RAM页面不可用时，这并不是一帆风顺的。 Then the OS has to pillage one. 然后操作系统必须掠夺一个。 There are as many as 4 distinct scenarios in your case that I can think of: 在您的案例中，我可以想到多达4个不同的场景：

a page is available but not yet zero-initialized by the low priority zero page thread. 页面可用但尚未由低优先级零页面线程零初始化。 Should be quick, it doesn't take much effort. 应该快，不需要太多努力。
a page needs to be stolen from another process and the content of that page does not need to be preserved. 需要从另一个进程中窃取页面，并且不需要保留该页面的内容。 Happens for pages that previously contained code for example. 发生以前包含代码的页面。 Pretty quick as well. 很快也很快。
a page needs to be stolen and its content needs to be preserved in the paging file. 页面需要被盗，其内容需要保存在页面文件中。 Happens for pages that previously contained data for example. 例如，发生以前包含数据的页面。 A hard page fault, that one hurts. 一个硬页错误，一个人伤害。 The processor will be stalled while the disk write takes place. 磁盘写入发生时，处理器将停止运行。
specific to your scenario, the HyperV manager decides that it time to borrow more RAM from the host operating system. 特定于您的方案，HyperV管理器决定是时候从主机操作系统借用更多RAM。 All of the previous bullets apply to that OS, plus the overhead of the OS interaction. 所有以前的项目符号都适用于该操作系统，以及操作系统交互的开销。 No real idea how much overhead that entails, ought to be painful as well. 不知道需要多少开销，也应该是痛苦的。

Which of those bullets you are going to hit is very, very unpredictable. 你打算击中哪一颗子弹是非常非常难以预测的。 Most of all because it isn't just your program that is involved, whatever else runs on the machine affects it as well. 最重要的是因为它不仅仅涉及您的程序，而且机器上运行的任何其他程序也会影响它。 And there's a memory-effect, something like writing a big file just before you start the test will have a drastic side-effect, caused by RAM pages being used by the file system cache that are waiting for the disk. 并且存在记忆效应，比如在开始测试之前编写大文件会产生极大的副作用，这是由等待磁盘的文件系统缓存使用的RAM页面引起的。 Or another process having an allocation burst and draining the zero page queue. 或者另一个具有分配突发并耗尽零页面队列的进程。 Or the memory bus getting saturated, pretty easy to do, could be affected by the host OS as well. 或者内存总线变得饱和，很容易做到，也可能受到主机操作系统的影响。 Etcetera. 等等。

The long and short of it is that profiling this code just is not very meaningful. 它的长短是对这段代码进行分析并不是很有意义。 Anything can and will happen and you don't have a decent way to predict it will. 任何可以而且将要发生的事情，你没有一个体面的方式来预测它。 Or a good way to do anything about it, other than giving the VM gobs of RAM and not running anything else on it :) Profiling results for the the second pass through the array is going to be a lot more stable and meaningful, the OS now is no longer involved. 或者对它做任何事情的好方法，除了给VM提供RAM并且不运行任何其他东西:) 第二次通过数组的分析结果将更加稳定和有意义，操作系统现在不再涉及了。