How do I prevent GCC/Clang from inlining and optimizing out multiple invocations of a pure function?
I am trying to benchmark code of this form
int __attribute__ ((noinline)) my_loop(int const* array, int len) {
// Use array to compute result.
}
My benchmark code looks something like this:
int main() {
const int number = 2048;
// My own aligned_malloc implementation.
int* input = (int*)aligned_malloc(sizeof(int) * number, 32);
// Fill the array with some random numbers.
make_random(input, number);
const int num_runs = 10000000;
for (int i = 0; i < num_runs; i++) {
const int result = my_loop(input, number); // Call pure function.
}
// Since the program exits I don't free input.
}
As expected Clang seems to be able to turn this into a no-op at O2 (perhaps even at O1).
A few things I tried to actually benchmark my implementation are:
Accumulate the intermediate results in an integer and print the results at the end:
const int num_runs = 10000000; uint64_t total = 0; for (int i = 0; i < num_runs; i++) { total += my_loop(input, number); // Call pure function. } printf("Total is %llu\\n", total);
Sadly this doesn't seem to work. Clang at least is smart enough to realize that this is a pure function and transforms the benchmark to something like this:
int result = my_loop(); uint64_t total = num_runs * result; printf("Total is %llu\\n", total);
Set an atomic variable using release semantics at the end of every loop iteration:
const int num_runs = 10000000; std::atomic<uint64_t> result_atomic(0); for (int i = 0; i < num_runs; i++) { int result = my_loop(input, number); // Call pure function. // Tried std::memory_order_release too. result_atomic.store(result, std::memory_order_seq_cst); } printf("Result is %llu\\n", result_atomic.load());
My hope was that since atomics introduce a happens-before
relationship, Clang would be forced to execute my code. But sadly it still did the optimization above and sets the value of the atomic to num_runs * result
in one shot instead of running num_runs
iterations of the function.
Set a volatile int at the end of every loop along with summing the total.
const int num_runs = 10000000; uint64_t total = 0; volatile int trigger = 0; for (int i = 0; i < num_runs; i++) { total += my_loop(input, number); // Call pure function. trigger = 1; } // If I take this printf out, Clang optimizes the code away again. printf("Total is %llu\\n", total);
This seems to do the trick and my benchmarks seem to work. This is not ideal for a number of reasons.
Per my understanding of the C++11 memory model volatile set operations
do not establish a happens before
relationship so I can't be sure that some compiler will not decide to do the same num_runs * result_of_1_run
optimization .
Also this method seems undesirable since now I have an overhead (however tiny) of setting a volatile int on every run of my loop.
Is there a canonical way of preventing Clang/GCC from optimizing this result away. Maybe with a pragma or something? Bonus points if this ideal method works across compilers.
You can insert instruction directly into the assembly. I sometimes uses a macro for splitting up the assembly, eg separating loads from calculations and branching.
#define GCC_SPLIT_BLOCK(str) __asm__( "//\n\t// " str "\n\t//\n" );
Then in the source you insert
GCC_SPLIT_BLOCK("Keep this please")
before and after your functions
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.