Data races, UB, and counters in C++11

Question

The following pattern is commonplace in lots of software that wants to tell its user how many times it has done various things:

int num_times_done_it; // global

void doit() {
  ++num_times_done_it;
  // do something
}

void report_stats() {
  printf("called doit %i times\n", num_times_done_it);
  // and probably some other stuff too
}

Unfortunately, if multiple threads can call doit without some sort of synchronisation, the concurrent read-modify-writes to num_times_done_it may be a data race and hence the entire program's behaviour would be undefined. Further, if report_stats can be called concurrently with doit absent any synchronisation, there's another data race between the thread modifying num_times_done_it and the thread reporting its value.

Often, the programmer just wants a mostly-right count of the number of times doit has been called with as little overhead as possible.

(If you consider this example trivial, Hogwild! gains a significant speed advantage over a data-race-free stochastic gradient descent using essentially this trick. Also, I believe the Hotspot JVM does exactly this sort of unguarded, multithreaded access to a shared counter for method invocation counts---though it's in the clear since it generates assembly code instead of C++11.)

Apparent non-solutions:

Atomics, with any memory order I know of, fail "as little overhead as possible" here (an atomic increment can be considerably more expensive than an ordinary increment) while overdelivering on "mostly-right" (by being exactly right).
I don't believe tossing volatile into the mix makes data races OK, so replacing the declaration of num_times_done_it by volatile int num_times_done_it doesn't fix anything.
There's the awkward solution of having a separate counter per thread and adding them all up in report_stats , but that doesn't solve the data race between doit and report_stats . Also, it's messy, it assumes the updates are associative, and doesn't really fit Hogwild!'s usage.

Is it possible to implement invocation counters with well-defined semantics in a nontrivial, multithreaded C++11 program without some form of synchronisation?

EDIT : It seems that we can do this in a slightly indirect way using memory_order_relaxed :

atomic<int> num_times_done_it;
void doit() {
  num_times_done_it.store(1 + num_times_done_it.load(memory_order_relaxed),
                          memory_order_relaxed);
  // as before
}

However, gcc 4.8.2 generates this code on x86_64 (with -O3):

   0:   8b 05 00 00 00 00       mov    0x0(%rip),%eax
   6:   83 c0 01                add    $0x1,%eax
   9:   89 05 00 00 00 00       mov    %eax,0x0(%rip)

and clang 3.4 generates this code on x86_64 (again with -O3):

   0:   8b 05 00 00 00 00       mov    0x0(%rip),%eax
   6:   ff c0                   inc    %eax
   8:   89 05 00 00 00 00       mov    %eax,0x0(%rip)

My understanding of x86-TSO is that both of these code sequences are, barring interrupts and funny page protection flags, entirely equivalent to the one-instruction memory inc and the one-instruction memory add generated by the straightforward code. Does this use of memory_order_relaxed constitute a data race?

Answer 1

count for each thread separately and sum up after the threads joined. For intermediate results, you may also sum up in between, you result might be off though. This pattern is also faster. You might embed it into a basic helper class for your threads so you have it everywheren if you are using it often.

And - depending on compiler & platform, atomics aren't that expensive (see Herb Sutters "atomic weapons" talk http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2 ) but in your case it'll create problems with the caches so it's not advisable.

Answer 2

It seems that the memory_order_relaxed trick is the right way to do this.

This blog post by Dmitry Vyukov at Intel begins by answering exactly my question, and proceeds to list the memory_order_relaxed store and load as the proper alternative.

I am still unsure of whether this is really OK; in particular, N3710 makes me doubt that I ever understood memory_order_relaxed in the first place.

Data races, UB, and counters in C++11

Question

2 answers

solution1
0 2014-04-11 05:17:47

solution2
0 ACCPTED 2014-04-12 01:49:04

Data races, UB, and counters in C++11

Question

2 answers

solution1 0 2014-04-11 05:17:47

solution2 0 ACCPTED 2014-04-12 01:49:04

solution1
0 2014-04-11 05:17:47

solution2
0 ACCPTED 2014-04-12 01:49:04