[英]Data races, UB, and counters in C++11
The following pattern is commonplace in lots of software that wants to tell its user how many times it has done various things: 以下模式在许多软件中很常见,这些软件想告诉用户它做了多少事情:
int num_times_done_it; // global
void doit() {
++num_times_done_it;
// do something
}
void report_stats() {
printf("called doit %i times\n", num_times_done_it);
// and probably some other stuff too
}
Unfortunately, if multiple threads can call doit
without some sort of synchronisation, the concurrent read-modify-writes to num_times_done_it
may be a data race and hence the entire program's behaviour would be undefined. 不幸的是,如果多个线程可以调用
doit
没有某种形式的同步,并发读-修改-写入num_times_done_it
可能是一个数据的比赛,因此整个程序的行为将是不确定的。 Further, if report_stats
can be called concurrently with doit
absent any synchronisation, there's another data race between the thread modifying num_times_done_it
and the thread reporting its value. 此外,如果
report_stats
可以同时使用称为doit
缺席任何同步,还有线程修改之间的另一数据竞争num_times_done_it
和线程报告其价值。
Often, the programmer just wants a mostly-right count of the number of times doit
has been called with as little overhead as possible. 通常,程序员只想要尽可能少的开销来调用
doit
的次数。
(If you consider this example trivial, Hogwild! gains a significant speed advantage over a data-race-free stochastic gradient descent using essentially this trick. Also, I believe the Hotspot JVM does exactly this sort of unguarded, multithreaded access to a shared counter for method invocation counts---though it's in the clear since it generates assembly code instead of C++11.) (如果你认为这个例子是微不足道的, Hogwild!比使用基本上这个技巧的数据无竞争随机梯度下降获得了显着的速度优势。而且,我相信Hotspot JVM正是这种无人看守,多线程访问共享计数器对于方法调用计数---虽然它是明确的,因为它生成汇编代码而不是C ++ 11。)
Apparent non-solutions: 明显的非解决方案:
volatile
into the mix makes data races OK, so replacing the declaration of num_times_done_it
by volatile int num_times_done_it
doesn't fix anything. volatile
会使数据num_times_done_it
正常,所以用volatile int num_times_done_it
替换num_times_done_it
的声明并不能解决任何问题。 report_stats
, but that doesn't solve the data race between doit
and report_stats
. report_stats
,但这并不解决之间的数据争doit
和report_stats
。 Also, it's messy, it assumes the updates are associative, and doesn't really fit Hogwild!'s usage. Is it possible to implement invocation counters with well-defined semantics in a nontrivial, multithreaded C++11 program without some form of synchronisation? 是否有可能在一个非平凡的多线程C ++ 11程序中实现具有良好定义语义的调用计数器,而无需某种形式的同步?
EDIT : It seems that we can do this in a slightly indirect way using memory_order_relaxed
: 编辑 :似乎我们可以使用
memory_order_relaxed
以稍微间接的方式执行此memory_order_relaxed
:
atomic<int> num_times_done_it;
void doit() {
num_times_done_it.store(1 + num_times_done_it.load(memory_order_relaxed),
memory_order_relaxed);
// as before
}
However, gcc 4.8.2
generates this code on x86_64 (with -O3): 但是,
gcc 4.8.2
在x86_64(带-O3)上生成此代码:
0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
6: 83 c0 01 add $0x1,%eax
9: 89 05 00 00 00 00 mov %eax,0x0(%rip)
and clang 3.4
generates this code on x86_64 (again with -O3): 和
clang 3.4
在x86_64上生成此代码(再次使用-O3):
0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
6: ff c0 inc %eax
8: 89 05 00 00 00 00 mov %eax,0x0(%rip)
My understanding of x86-TSO is that both of these code sequences are, barring interrupts and funny page protection flags, entirely equivalent to the one-instruction memory inc
and the one-instruction memory add
generated by the straightforward code. 我对x86-TSO的理解是这两个代码序列都禁止中断和有趣的页面保护标志,完全等同于单指令存储器
inc
和由简单代码生成的单指令存储器add
。 Does this use of memory_order_relaxed
constitute a data race? memory_order_relaxed
使用是否构成数据竞争?
count for each thread separately and sum up after the threads joined. 分别计算每个线程,并在线程加入后总结。 For intermediate results, you may also sum up in between, you result might be off though.
对于中间结果,您也可以在两者之间进行总结,但结果可能会关闭。 This pattern is also faster.
这种模式也更快。 You might embed it into a basic helper class for your threads so you have it everywheren if you are using it often.
您可以将它嵌入到线程的基本帮助器类中,这样如果您经常使用它,就可以使用它。
And - depending on compiler & platform, atomics aren't that expensive (see Herb Sutters "atomic weapons" talk http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2 ) but in your case it'll create problems with the caches so it's not advisable. 并且 - 取决于编译器和平台,原子并不那么昂贵(参见Herb Sutters“原子武器”谈话http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter -atomic-Weapons-1-of-2 )但在你的情况下,它会产生缓存问题,所以这是不可取的。
It seems that the memory_order_relaxed
trick is the right way to do this. 似乎
memory_order_relaxed
技巧是正确的方法。
This blog post by Dmitry Vyukov at Intel begins by answering exactly my question, and proceeds to list the memory_order_relaxed
store
and load
as the proper alternative. 英特尔的Dmitry Vyukov撰写的这篇博文首先回答了我的问题,并继续列出
memory_order_relaxed
store
并load
为正确的选择。
I am still unsure of whether this is really OK; 我仍然不确定这是否真的好; in particular, N3710 makes me doubt that I ever understood
memory_order_relaxed
in the first place. 特别是, N3710让我怀疑我是否曾首先理解
memory_order_relaxed
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.