Benchmarking a pure C++ function

Question

How do I prevent GCC/Clang from inlining and optimizing out multiple invocations of a pure function?

I am trying to benchmark code of this form

int __attribute__ ((noinline)) my_loop(int const* array, int len) {
   // Use array to compute result.
 }

My benchmark code looks something like this:

int main() {
  const int number = 2048;
   // My own aligned_malloc implementation.
  int* input = (int*)aligned_malloc(sizeof(int) * number, 32);
  // Fill the array with some random numbers.
  make_random(input, number);
  const int num_runs = 10000000;
  for (int i = 0; i < num_runs; i++) {
     const int result = my_loop(input, number); // Call pure function.
  }
  // Since the program exits I don't free input.
}

As expected Clang seems to be able to turn this into a no-op at O2 (perhaps even at O1).

A few things I tried to actually benchmark my implementation are:

Accumulate the intermediate results in an integer and print the results at the end:

 const int num_runs = 10000000; uint64_t total = 0; for (int i = 0; i < num_runs; i++) { total += my_loop(input, number); // Call pure function. } printf("Total is %llu\\n", total);

Sadly this doesn't seem to work. Clang at least is smart enough to realize that this is a pure function and transforms the benchmark to something like this:

 int result = my_loop(); uint64_t total = num_runs * result; printf("Total is %llu\\n", total);

Set an atomic variable using release semantics at the end of every loop iteration:
```
 const int num_runs = 10000000; std::atomic<uint64_t> result_atomic(0); for (int i = 0; i < num_runs; i++) { int result = my_loop(input, number); // Call pure function. // Tried std::memory_order_release too. result_atomic.store(result, std::memory_order_seq_cst); } printf("Result is %llu\\n", result_atomic.load()); 
```
My hope was that since atomics introduce a happens-before relationship, Clang would be forced to execute my code. But sadly it still did the optimization above and sets the value of the atomic to num_runs * result in one shot instead of running num_runs iterations of the function.

Set a volatile int at the end of every loop along with summing the total.

 const int num_runs = 10000000; uint64_t total = 0; volatile int trigger = 0; for (int i = 0; i < num_runs; i++) { total += my_loop(input, number); // Call pure function. trigger = 1; } // If I take this printf out, Clang optimizes the code away again. printf("Total is %llu\\n", total);

This seems to do the trick and my benchmarks seem to work. This is not ideal for a number of reasons.

Per my understanding of the C++11 memory model volatile set operations do not establish a happens before relationship so I can't be sure that some compiler will not decide to do the same num_runs * result_of_1_run optimization .
Also this method seems undesirable since now I have an overhead (however tiny) of setting a volatile int on every run of my loop.

Is there a canonical way of preventing Clang/GCC from optimizing this result away. Maybe with a pragma or something? Bonus points if this ideal method works across compilers.

Answer 1

You can insert instruction directly into the assembly. I sometimes uses a macro for splitting up the assembly, eg separating loads from calculations and branching.

#define GCC_SPLIT_BLOCK(str)  __asm__( "//\n\t// " str "\n\t//\n" );

Then in the source you insert

GCC_SPLIT_BLOCK("Keep this please")

before and after your functions

Benchmarking a pure C++ function

Question

1 answers

solution1
1 ACCPTED 2015-07-23 17:24:48

Benchmarking a pure C++ function

Question

1 answers

solution1 1 ACCPTED 2015-07-23 17:24:48

solution1
1 ACCPTED 2015-07-23 17:24:48