简体   繁体   中英

Profiling C++ threads with clock()

I am trying to measure how gcc threads perform on my system. I've written some very simple measurement code which is something like this...

start = clock();
for(int i=0; i < thread_iters; i++) {
  pthread_mutex_lock(dataMutex);
  data++;
  pthread_mutex_unlock(dataMutex);
}
end = clock();

I do the usual subtract and div by CLOCKS_PER_SEC to get an elapsed time of about 2 seconds for 100000000 iterations. I then change the profiling code slightly so I am measuring the individual time for each mutex_lock/unlock call.

for(int i=0; i < thread_iters; i++) {
  start1 = clock();
  pthread_mutex_lock(dataMutex);
  end1 = clock();
  lock_time+=(end1-start1);

  data++;

  start2 = clock();
  pthread_mutex_unlock(dataMutex);
  end2 = clock();
  unlock_time+=(end2-start2)
}

The times I get for the same number of iterations are lock: ~27 seconds unlock: ~27 seconds

I get why the total time for the program increases, more timer calls in the loop. But the time for the system calls should still add up to less than 2 seconds. Can someone help me figure out where I went wrong? Thanks!

The clock calls also measure the time it takes to call clock and return from it. This introduces a bias into the measurement. Ie somewhere deep inside the clock function it takes a sample. But then before running your code, it has to return from deep inside clock . And then when you take the end measurement, before that time sample can be taken, clock has to be called and control has to pass somewhere deep inside that function where it actually obtains the time. So you're including all that overhead as part of the measurement.

You must find out how much time elapses between consecutive clock calls (by taking some samples over many pairs of clock calls to get an accurate average). That gives you a baseline bias: how much time does it take to execute nothing at all between two clock samples. You then carefully subtract your bias from the measurements.

But calls to clock can disturb the performance so that you're not getting an accurate answer. Calls to the kernel to get the clock are disturbing your L1 cache and instruction cache. For fine grained measurements like this, it is better to drop down to inline assembly and read a cycle counting register from the CPU.

clock is best used as you have it in your first example: take samples around something that executes for many iterations, and then divide by the number of iterations to estimate the single-iteration time.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM