rdtsc代码显示内存特性（如TLB未命中）对性能的影响

Question

I was trying to understand rdtsc() and I came across the following code from http://www.mcs.anl.gov/~kazutomo/rdtsc.html .The text explaining the code reads "The next short benchmark code may show you some performance impacts from memory characteristic such as TLB miss, page fault or page swap in/out.".The problem is that I don't really understand how this shows performance from memory characterestics.Honestly I don't have a clue.It would be great if someone could explain this a little bit. 我试图了解rdtsc（），并从http://www.mcs.anl.gov/~kazutomo/rdtsc.html遇到了以下代码。解释该代码的文字为“下一个简短的基准代码可能会向您显示内存特性会影响性能，例如TLB未命中，页面错误或页面交换进/出。”问题是，我不太了解内存特性如何显示性能。老实说，我没有任何线索。如果有人可以解释一下这将是很棒的。

#include <stdio.h>
#include <assert.h>
#include <stdint.h>
#include <stdlib.h>

#include "rdtsc.h"

#define N (1024*1024*2)

int main(int argc, char* argv[])
{
  unsigned long long a,b;
  unsigned long long min,max;
  char* p;
  int i;

  p = (char*)malloc(N);
  assert( p!=(char*)0 );

  max = 0;
  min = UINT64_MAX;

  for(i=0; i<N; i++ ) {
    a = rdtsc();
    p[i] = 0;
    b = rdtsc() - a;
    if( b > max ) max = b;
    else if( b < min ) min = b;
  }
  printf("min=%llu\n", min);
  printf("max=%llu\n", max);
  return 0;
}

Answer 1

This code is simply looping through a 2MB buffer, writing 0 to each byte of it, and computing the time it takes to perform each write, updating a low- and high-water mark ( min and max ) that show the shortest & longest times required for making each write. 这段代码只是循环通过2MB缓冲区，向其每个字节写入0 ，并计算执行每次写入所花费的时间，更新显示最短和最长时间的高低水位标记（ min和max ）。进行每次写入所需。

Assuming this program is the ONLY program running on the CPU, and assuming no asynchronous events occur while it's running (hardware interrupts or timer interrupts), this program would show you both the nominal time for making a byte-wide write to memory, and the maximum amount of time required for handling a TLB miss exception and/or page fault exception. 假设该程序是在CPU上运行的唯一程序，并且假设它在运行时没有发生任何异步事件（硬件中断或计时器中断），则该程序将向您显示进行字节宽写入内存的标称时间以及处理TLB丢失异常和/或页面错误异常所需的最长时间。

A TLB miss exception is an exception that the core takes when a program tries to access memory for which there is not a TLB entry in the MMU. TLB未命中异常是内核在程序尝试访问MMU中没有TLB条目的内存时采用的异常。 The MMU is the police officer at the intersection of Core Avenue and Memory Lane, who directs traffic to where it's supposed to go. MMU是Core Avenue和Memory Lane交叉路口的警务人员，负责将交通引导至应该去的地方。 OK, that's a horrible analogy. 好，这是一个可怕的比喻。 The MMU (Memory Management Unit) has two main purposes: 1) route virtual memory accesses to the appropriate physical memory address, and 2) enforce read-only, read-write, read-execute, execute-only, etc. privileges so that a stray pointer access into a virtual memory region with conflicting attributes (or to an unmapped virtual memory region) will get trapped & raise a memory access exception (such as SIGSEGV on Linux). MMU（内存管理单元）具有两个主要目的：1）将虚拟内存访问路由到适当的物理内存地址，以及2）强制执行只读，读写，读取执行，仅执行等特权，以便对具有冲突属性的虚拟内存区域（或对未映射的虚拟内存区域）的杂散指针访问将被捕获并引发内存访问异常（例如Linux上的SIGSEGV）。 A TLB entry is a set of hardware registers in the MMU that tell the MMU the permissions of a virtual memory page or group of pages that are currently loaded into physical memory. TLB条目是MMU中的一组硬件寄存器，用于告知MMU虚拟内存页面或当前加载到物理内存中的页面组的权限。 But an MMU doesn't have an infinite number of TLB entries; 但是，MMU没有无限数量的TLB条目。 it doesn't have nearly enough TLB entries to describe the attributes of all of the pages of memory. 它几乎没有足够的TLB条目来描述所有内存页面的属性。 So if you try to access a legal address from your process's address space that doesn't have a current TLB entry describing the page in which it resides, you get a TLB miss exception. 因此，如果您尝试从流程的地址空间访问没有当前TLB条目描述其所在页面的合法地址，则会遇到TLB未命中的情况。 The TLB miss exception handler then fetches the proper TLB entry's data from main memory, and writes it into a TLB entry in the MMU; 然后，TLB遗漏异常处理程序会从主内存中获取正确的TLB条目数据，并将其写入MMU中的TLB条目中； the MMU may even have some built-in mechanism for telling the TLB miss exception handler which TLB entry it should use... probably the least-recently-used entry, which is the one most likely to not be needed again in the near future. MMU甚至可能具有一些内置机制，用于告知TLB未命中异常处理程序应使用哪个TLB条目...可能是最近最少使用的条目，这是在不久的将来最可能不再需要的条目。

A page fault is akin to a TLB miss exception, except that in this case, the content of that virtual memory page isn't even in physical memory... it may be altogether nonexistent (a newly-mapped page of memory), or it may have been previously swapped out to disk to make room in the limited physical memory for another page of virtual memory that the program needed at some point. 页面错误类似于TLB未命中异常，不同之处在于在这种情况下，该虚拟内存页面的内容甚至不在物理内存中……它可能根本不存在（新映射的内存页面），或者它可能先前已换出到磁盘上，以便在有限的物理内存中为程序有时需要的另一页虚拟内存腾出空间。 While TLB miss exceptions are normally pretty fast (but do affect performance nonetheless), a page fault exception may be a HUGE hit to performance if the page has to be pulled off of disk (even from an SSD!), since disk storage is typically an order of magnitude slower (or worse!) than memory accesses. 尽管TLB未命中异常通常非常快（但是仍然会影响性能），但是由于必须从磁盘（甚至从SSD！）中拉出页面，因此页面错误异常可能会对性能造成巨大影响。比内存访问慢一个数量级（或更糟！）。 For this reason, to keep the CPU busy working on something that's useful, an operating system's page fault exception handler often causes the currently-running process to swap out in favor of running a different process (one that's in the "ready" state), pending receipt of the data off of disk for filling up the requested virtual memory page. 因此，为了让CPU忙于处理有用的事情，操作系统的页面错误异常处理程序通常会导致当前正在运行的进程换出以支持运行其他进程（处于“就绪”状态的进程），在磁盘上等待接收数据以填满请求的虚拟内存页面。

Now, back to this "test code" and the efficacy of its results: 现在，回到此“测试代码”及其结果的有效性：

This test depends on the OS+runtime NOT pre-allocating memory pages in the call to malloc(N) . 此测试取决于OS + runtime在对malloc(N)的调用中未预分配内存页面。 I believe this is probably typical behavior; 我相信这可能是典型的行为； even though the runtime has allocated that much memory & knows the address range that it allocated, the actual pages for that memory are often not allocated by the OS until your program actually accesses (reads or writes) an address in a given page. 即使运行时分配了这么多的内存并知道其分配的地址范围，但操作系统通常不会分配该内存的实际页面，直到您的程序实际访问（读取或写入）给定页面中的地址为止。 Pages are 4KB on many platforms, but could be much larger, too, such as 4MB pages on newer Intel Pentium derivatives. 在许多平台上，页面的大小为4KB，但也可能更大，例如，较新的Intel Pentium衍生产品的页面大小为4MB。

So assuming your platform's page size is 4KB (4096 bytes), as your programs walks through the 2MB allocated space writing 0 's to it a byte at a time, it will go through 1024 of these 4KB pages. 因此，假设平台的页面大小为4KB（4096字节），当您的程序遍历2MB分配的空间，一次向其写入0字节时，它将遍历这4KB页面中的1024个。 So 4193280 of these writes should occur "as fast as possible" (without triggering a TLB miss or page fault exception). 因此，这些写入中的4193280应该“尽可能快地”发生（不触发TLB丢失或页面错误异常）。 And up to 1024 of them will trigger TLB miss and/or page fault exceptions. 并且其中多达1024个将触发TLB丢失和/或页面错误异常。 So the 'min' time gives the fastest time possible to perform a write given that the written address resides in an already-loaded virtual memory page and its TLB entry is currently resident in the MMU. 因此，鉴于写入的地址位于已加载的虚拟内存页面中并且其TLB条目当前位于MMU中，因此“最小”时间为执行写入提供了最快的时间。 The 'max' time gives the worst possible time to perform a write, presumably to an address that resides in a page that is not yet mapped into physical memory (and which triggered a page fault exception, and perhaps also a TLB miss exception). “最大”时间为执行写操作提供了最坏的可能时间，大概是写给尚未映射到物理内存中的页面上的地址（并触发了页面错误异常，也可能触发了TLB未命中异常）。

There are two problems with this test, if we're depending on its results to reveal some characteristics of the underlying hardware: 1) By itself, this code neglects the effect of process swapping and/or hardware interrupts for other reasons, such as time-slicing and network packets being received & processed "in the background" (which can interrupt the running process). 如果我们依靠其测试结果来揭示底层硬件的某些特性，则此测试有两个问题：1）就其本身而言，此代码忽略了进程交换和/或由于其他原因（例如时间）导致的硬件中断的影响-切片和网络数据包“在后台”被接收和处理（这可能会中断正在运行的进程）。 And... 2) The 2MB test buffer isn't even as large as the 4MB page size of newer Intel processors' MMUs. 而且... 2）2MB的测试缓冲区甚至不及新型Intel处理器的MMU的4MB页面大小。 I don't know what conditions dictate whether operating systems choose to use 4KB pages or 4MB pages, so this may or may not be a factor on your system. 我不知道什么条件决定操作系统选择使用4KB页面还是4MB页面，因此这可能对您的系统造成影响。 Just be aware that if your min and max are on the same order of magnitude as each other, then likely you're on a system with 4MB pages, and if your min and max differ by an order of magnitude or more, the difference may not be entirely attributable to TLB miss and page fault exceptions. 请注意，如果您的min和max彼此处于相同的数量级，则可能是您使用的系统具有4MB页面，并且如果您的min和max相差一个数量级或更多，则差异可能不能完全归因于TLB未命中和页面错误异常。 Perhaps this is why the author hedged a bit in his statement that the code " may show you some performance impacts..." (emphasis added). 也许这就是为什么作者在声明中对代码“ 可能会向您显示一些性能影响...”稍加掩饰的原因（加了强调）。

rdtsc代码显示内存特性（如TLB未命中）对性能的影响

问题描述

1 个解决方案

解决方案1
1 已采纳 2013-08-22 18:31:51

rdtsc代码显示内存特性（如TLB未命中）对性能的影响

问题描述

1 个解决方案

解决方案1 1 已采纳 2013-08-22 18:31:51

解决方案1
1 已采纳 2013-08-22 18:31:51