简体   繁体   English

寻找一种精确的方法来微基准测试用C ++编写并在Linux / OSX上运行的小代码路径

[英]Looking for an accurate way to micro benchmark small code paths written in C++ and running on Linux/OSX

I'm looking to do some very basic micro benchmarking of small code paths, such as tight loops, that I've written in C++. 我正在寻找一些我用C ++编写的小代码路径的非常基本的微基准测试,例如紧密循环。 I'm running on Linux and OSX, and using GCC. 我在Linux和OSX上运行,并使用GCC。 What facilities are there for sub millisecond accuracy? 亚毫秒精度有哪些设施? I am thinking a simple test of running the code path many times (several tens of millions?) will give me enough consistency to get a good reading. 我认为多次运行代码路径(数千万?)的简单测试将为我提供足够的一致性以获得良好的阅读效果。 If anyone knows of preferable methods, please feel free to suggest them. 如果有人知道更好的方法,请随时提出建议。

You can use "rdtsc" processor instruction on x86/x86_64. 您可以在x86 / x86_64上使用"rdtsc"处理器指令。 For multicore systems check the "constant_tsc" capability in CPUID (/proc/cpuinfo in linux) - it will mean that all cores use the same tick counter, even with dynamic freq changing and sleeping. 对于多核系统,请检查CPUID(在Linux中为/ proc / cpuinfo)中的“ constant_tsc”功能-这意味着所有内核都使用相同的滴答计数器,即使动态频率更改和休眠也是如此。

If your processor does not support constant_tsc, be sure to bind you programm to the core ( taskset utility in Linux). 如果您的处理器不支持constant_tsc,请确保将您的programm绑定到核心(Linux中的taskset实用程序)。

When using rdtsc on out-of-order CPUs (All besides Intel Atom, may be some other low-end cpus), add an "ordering" instruction before, eg "cpuid" - it will temporary disable instruction reordering. 在乱序的CPU上使用rdtsc时(除了Intel Atom以外,可能都是其他一些低端cpus),请在前面添加“ ordering”指令,例如“ cpuid”-它将暂时禁用指令重新排序。

Also, MacOsX has "Shark" which can measure some hardware events in your code. 另外,MacOsX具有“ Shark”,可以测量代码中的某些硬件事件。

RDTSC and out-of-order CPUs. RDTSC和乱序的CPU。 More info in section 18 of the 2nd great Fog's manual on optimization: Optimizing subroutines in assembly language: An optimization guide for x86 platforms (the main site with all the five manuals is http://www.agner.org/optimize/ ) 有关优化的第二本伟大的Fog手册的第18节中的更多信息: 以汇编语言优化子例程:x86平台的优化指南 (有关这五本手册的主要站点为http://www.agner.org/optimize/

http://www.scribd.com/doc/1548519/optimizing-assembly http://www.scribd.com/doc/1548519/optimizing-assembly

On all processors with out-of-order execution, you have to insert XOR EAX,EAX / CPUID before and after each read of the counter in order to prevent it from executing in parallel with anything else. 在所有乱序执行的处理器上,您必须在每次读取计数器之前和之后插入XOR EAX,EAX / CPUID,以防止其与其他任何事物并行执行。 CPUID is a serializing instruction, which means that it flushes the pipeline and waits for all pending operations to finish before proceeding. CPUID是一个序列化指令,这意味着它将刷新管道并等待所有未完成的操作完成后再继续。 This is very useful for testing purposes. 这对于测试非常有用。

Microbenchmark should run the same code in a loop, preferably over lots of iteration. Microbenchmark应该在循环中运行相同的代码,最好是经过大量迭代。 I used the following and ran it with time(1) utility; 我使用了以下内容,并使用time(1)实用程序运行了它;

following caveats were observed 观察到以下警告

  • if the test does not produce a computation that is printed out then code is eliminated by optimization - gcc with -O3 does that. 如果测试未产生打印出来的计算,则可以通过优化消除代码-带-O3的gcc可以这样做。

  • the test functions of test() and lookup() must be implemented in a different source file than the loop of the iteration; test()和lookup()的测试函数必须在与迭代循环不同的源文件中实现; if they are in the same file and the lookup function returns constant value then code optimization would not call it, not once once, it would just multiply the return value by number of iterations ! 如果它们在同一个文件中,并且查找函数返回常量值,则代码优化将不会调用它,而不是一次,它将仅将返回值乘以迭代次数!

file main.c 文件main.c

#include <stdio.h>

#define RUN_COUNT 10000000

void init();
int  lookup();


main()
{
  int sum = 0;
  int i;

  init();


  for(i = 0; i < RUN_COUNT; i++ ) {
    sum  += lookup();
  }

  printf("%d", sum );
}

This is what I've used in the past: 这是我过去使用的:

inline double gettime ()
{
    timeval tv;
    gettimeofday (&tv, NULL);
    return double (tv.tv_sec) + 0.000001 * tv.tv_usec;
}

And then: 然后:

double startTime = gettime();
// your code here
double runTime = gettime() - startTime;

This will quote to the microsecond. 这将引用微秒。

Cachegrind / kCachegrind are good for very fine-grained profiling. Cachegrind / kCachegrind是非常细粒度剖析不错。 I don't believe they're available for OS X, but the results you get on Linux should be representative. 我不认为它们可用于OS X,但是在Linux上获得的结果应该具有代表性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM