简体   繁体   English

C ++ 11中的数据竞争,UB和计数器

[英]Data races, UB, and counters in C++11

The following pattern is commonplace in lots of software that wants to tell its user how many times it has done various things: 以下模式在许多软件中很常见,这些软件想告诉用户它做了多少事情:

int num_times_done_it; // global

void doit() {
  ++num_times_done_it;
  // do something
}

void report_stats() {
  printf("called doit %i times\n", num_times_done_it);
  // and probably some other stuff too
}

Unfortunately, if multiple threads can call doit without some sort of synchronisation, the concurrent read-modify-writes to num_times_done_it may be a data race and hence the entire program's behaviour would be undefined. 不幸的是,如果多个线程可以调用doit没有某种形式的同步,并发读-修改-写入num_times_done_it可能是一个数据的比赛,因此整个程序的行为将是不确定的。 Further, if report_stats can be called concurrently with doit absent any synchronisation, there's another data race between the thread modifying num_times_done_it and the thread reporting its value. 此外,如果report_stats可以同时使用称为doit缺席任何同步,还有线程修改之间的另一数据竞争num_times_done_it和线程报告其价值。

Often, the programmer just wants a mostly-right count of the number of times doit has been called with as little overhead as possible. 通常,程序员只想要尽可能少的开销来调用doit的次数。

(If you consider this example trivial, Hogwild! gains a significant speed advantage over a data-race-free stochastic gradient descent using essentially this trick. Also, I believe the Hotspot JVM does exactly this sort of unguarded, multithreaded access to a shared counter for method invocation counts---though it's in the clear since it generates assembly code instead of C++11.) (如果你认为这个例子是微不足道的, Hogwild!比使用基本上这个技巧的数据无竞争随机梯度下降获得了显着的速度优势。而且,我相信Hotspot JVM正是这种无人看守,多线程访问共享计数器对于方法调用计数---虽然它是明确的,因为它生成汇编代码而不是C ++ 11。)

Apparent non-solutions: 明显的非解决方案:

  • Atomics, with any memory order I know of, fail "as little overhead as possible" here (an atomic increment can be considerably more expensive than an ordinary increment) while overdelivering on "mostly-right" (by being exactly right). 原子论,我所知道的任何内存顺序,在这里“尽可能少的开销”失败(原子增量可能比普通增量贵得多),而在“大多数正确”(通过完全正确)过度交付。
  • I don't believe tossing volatile into the mix makes data races OK, so replacing the declaration of num_times_done_it by volatile int num_times_done_it doesn't fix anything. 我不相信在混合中抛出volatile会使数据num_times_done_it正常,所以用volatile int num_times_done_it替换num_times_done_it的声明并不能解决任何问题。
  • There's the awkward solution of having a separate counter per thread and adding them all up in report_stats , but that doesn't solve the data race between doit and report_stats . 有有每个线程独立的柜台,把它们加起来在尴尬的解决方案report_stats ,但这并不解决之间的数据争doitreport_stats Also, it's messy, it assumes the updates are associative, and doesn't really fit Hogwild!'s usage. 此外,它很混乱,它假设更新是关联的,并不真正适合Hogwild!的用法。

Is it possible to implement invocation counters with well-defined semantics in a nontrivial, multithreaded C++11 program without some form of synchronisation? 是否有可能在一个非平凡的多线程C ++ 11程序中实现具有良好定义语义的调用计数器,而无需某种形式的同步?

EDIT : It seems that we can do this in a slightly indirect way using memory_order_relaxed : 编辑 :似乎我们可以使用memory_order_relaxed以稍微间接的方式执行此memory_order_relaxed

atomic<int> num_times_done_it;
void doit() {
  num_times_done_it.store(1 + num_times_done_it.load(memory_order_relaxed),
                          memory_order_relaxed);
  // as before
}

However, gcc 4.8.2 generates this code on x86_64 (with -O3): 但是, gcc 4.8.2在x86_64(带-O3)上生成此代码:

   0:   8b 05 00 00 00 00       mov    0x0(%rip),%eax
   6:   83 c0 01                add    $0x1,%eax
   9:   89 05 00 00 00 00       mov    %eax,0x0(%rip)

and clang 3.4 generates this code on x86_64 (again with -O3): clang 3.4在x86_64上生成此代码(再次使用-O3):

   0:   8b 05 00 00 00 00       mov    0x0(%rip),%eax
   6:   ff c0                   inc    %eax
   8:   89 05 00 00 00 00       mov    %eax,0x0(%rip)

My understanding of x86-TSO is that both of these code sequences are, barring interrupts and funny page protection flags, entirely equivalent to the one-instruction memory inc and the one-instruction memory add generated by the straightforward code. 我对x86-TSO的理解是这两个代码序列都禁止中断和有趣的页面保护标志,完全等同于单指令存储器inc和由简单代码生成的单指令存储器add Does this use of memory_order_relaxed constitute a data race? memory_order_relaxed使用是否构成数据竞争?

count for each thread separately and sum up after the threads joined. 分别计算每个线程,并在线程加入后总结。 For intermediate results, you may also sum up in between, you result might be off though. 对于中间结果,您也可以在两者之间进行总结,但结果可能会关闭。 This pattern is also faster. 这种模式也更快。 You might embed it into a basic helper class for your threads so you have it everywheren if you are using it often. 您可以将它嵌入到线程的基本帮助器类中,这样如果您经常使用它,就可以使用它。

And - depending on compiler & platform, atomics aren't that expensive (see Herb Sutters "atomic weapons" talk http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter-atomic-Weapons-1-of-2 ) but in your case it'll create problems with the caches so it's not advisable. 并且 - 取决于编译器和平台,原子并不那么昂贵(参见Herb Sutters“原子武器”谈话http://channel9.msdn.com/Shows/Going+Deep/Cpp-and-Beyond-2012-Herb-Sutter -atomic-Weapons-1-of-2 )但在你的情况下,它会产生缓存问题,所以这是不可取的。

It seems that the memory_order_relaxed trick is the right way to do this. 似乎memory_order_relaxed技巧是正确的方法。

This blog post by Dmitry Vyukov at Intel begins by answering exactly my question, and proceeds to list the memory_order_relaxed store and load as the proper alternative. 英特尔的Dmitry Vyukov撰写的这篇博文首先回答了我的问题,并继续列出memory_order_relaxed storeload为正确的选择。

I am still unsure of whether this is really OK; 我仍然不确定这是否真的好; in particular, N3710 makes me doubt that I ever understood memory_order_relaxed in the first place. 特别是, N3710让我怀疑我是否曾首先理解memory_order_relaxed

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM