简体   繁体   中英

Looking for an accurate way to micro benchmark small code paths written in C++ and running on Linux/OSX

I'm looking to do some very basic micro benchmarking of small code paths, such as tight loops, that I've written in C++. I'm running on Linux and OSX, and using GCC. What facilities are there for sub millisecond accuracy? I am thinking a simple test of running the code path many times (several tens of millions?) will give me enough consistency to get a good reading. If anyone knows of preferable methods, please feel free to suggest them.

You can use "rdtsc" processor instruction on x86/x86_64. For multicore systems check the "constant_tsc" capability in CPUID (/proc/cpuinfo in linux) - it will mean that all cores use the same tick counter, even with dynamic freq changing and sleeping.

If your processor does not support constant_tsc, be sure to bind you programm to the core ( taskset utility in Linux).

When using rdtsc on out-of-order CPUs (All besides Intel Atom, may be some other low-end cpus), add an "ordering" instruction before, eg "cpuid" - it will temporary disable instruction reordering.

Also, MacOsX has "Shark" which can measure some hardware events in your code.

RDTSC and out-of-order CPUs. More info in section 18 of the 2nd great Fog's manual on optimization: Optimizing subroutines in assembly language: An optimization guide for x86 platforms (the main site with all the five manuals is http://www.agner.org/optimize/ )

http://www.scribd.com/doc/1548519/optimizing-assembly

On all processors with out-of-order execution, you have to insert XOR EAX,EAX / CPUID before and after each read of the counter in order to prevent it from executing in parallel with anything else. CPUID is a serializing instruction, which means that it flushes the pipeline and waits for all pending operations to finish before proceeding. This is very useful for testing purposes.

Microbenchmark should run the same code in a loop, preferably over lots of iteration. I used the following and ran it with time(1) utility;

following caveats were observed

  • if the test does not produce a computation that is printed out then code is eliminated by optimization - gcc with -O3 does that.

  • the test functions of test() and lookup() must be implemented in a different source file than the loop of the iteration; if they are in the same file and the lookup function returns constant value then code optimization would not call it, not once once, it would just multiply the return value by number of iterations !

file main.c

#include <stdio.h>

#define RUN_COUNT 10000000

void init();
int  lookup();


main()
{
  int sum = 0;
  int i;

  init();


  for(i = 0; i < RUN_COUNT; i++ ) {
    sum  += lookup();
  }

  printf("%d", sum );
}

This is what I've used in the past:

inline double gettime ()
{
    timeval tv;
    gettimeofday (&tv, NULL);
    return double (tv.tv_sec) + 0.000001 * tv.tv_usec;
}

And then:

double startTime = gettime();
// your code here
double runTime = gettime() - startTime;

This will quote to the microsecond.

Cachegrind / kCachegrind are good for very fine-grained profiling. I don't believe they're available for OS X, but the results you get on Linux should be representative.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM