简体   繁体   English

C++多线程原子加载/存储

[英]c++ multithread atomic load/store

When I read the 5th chapter of the book CplusplusConcurrencyInAction , the example code as follows, multithread load/store some atomic values concurrently with the momery_order_relaxed.Three array save the value of x、y and z respectively at each round.当我阅读CplusplusConcurrencyInAction一书的第5章时,示例代码如下,多线程并发加载/存储一些原子值与momery_order_relaxed。三个数组在每一轮分别保存x、y and z的值。

#include <thread>
#include <atomic>
#include <iostream>
​
std::atomic<int> x(0),y(0),z(0);  // 1
std::atomic<bool> go(false);  // 2
​
unsigned const loop_count=10;
​
struct read_values
{
  int x,y,z;
};
​
read_values values1[loop_count];
read_values values2[loop_count];
read_values values3[loop_count];
read_values values4[loop_count];
read_values values5[loop_count];
​
void increment(std::atomic<int>* var_to_inc,read_values* values)
{
  while(!go)
    std::this_thread::yield();  
  for(unsigned i=0;i<loop_count;++i)
  {
    values[i].x=x.load(std::memory_order_relaxed);
    values[i].y=y.load(std::memory_order_relaxed);
    values[i].z=z.load(std::memory_order_relaxed);
    var_to_inc->store(i+1,std::memory_order_relaxed);  // 4
    std::this_thread::yield();
  }
}
​
void read_vals(read_values* values)
{
  while(!go)
    std::this_thread::yield(); 
  for(unsigned i=0;i<loop_count;++i)
  {
    values[i].x=x.load(std::memory_order_relaxed);
    values[i].y=y.load(std::memory_order_relaxed);
    values[i].z=z.load(std::memory_order_relaxed);
    std::this_thread::yield();
  }
}
​
void print(read_values* v)
{
  for(unsigned i=0;i<loop_count;++i)
  {
    if(i)
      std::cout<<",";
    std::cout<<"("<<v[i].x<<","<<v[i].y<<","<<v[i].z<<")";
  }
  std::cout<<std::endl;
}
​
int main()
{
  std::thread t1(increment,&x,values1);
  std::thread t2(increment,&y,values2);
  std::thread t3(increment,&z,values3);
  std::thread t4(read_vals,values4);
  std::thread t5(read_vals,values5);
​
  go=true;  
​
  t5.join();
  t4.join();
  t3.join();
  t2.join();
  t1.join();
​
  print(values1);  
  print(values2);
  print(values3);
  print(values4);
  print(values5);
}

one of the valid output mentioned in this chapter:本章提到的有效输出之一:

(0,0,0),(1,0,0),(2,0,0),(3,0,0),(4,0,0),(5,7,0),(6,7,8),(7,9,8),(8,9,8),(9,9,10)
(0,0,0),(0,1,0),(0,2,0),(1,3,5),(8,4,5),(8,5,5),(8,6,6),(8,7,9),(10,8,9),(10,9,10)
(0,0,0),(0,0,1),(0,0,2),(0,0,3),(0,0,4),(0,0,5),(0,0,6),(0,0,7),(0,0,8),(0,0,9)
(1,3,0),(2,3,0),(2,4,1),(3,6,4),(3,9,5),(5,10,6),(5,10,8),(5,10,10),(9,10,10),(10,10,10)
(0,0,0),(0,0,0),(0,0,0),(6,3,7),(6,5,7),(7,7,7),(7,8,7),(8,8,7),(8,8,9),(8,8,9)

The 3rd output of values1 is (2,0,0) ,at this point it reads x=2 ,and y=z=0 .It means when y=0 ,the x is already equals to 2, Why the 3rd output of the values2 it reads x=0 and y=2 ,which means x is the old value because x、y、z is increasing, so when y=2 that x is at least 2. And I test the code in my PC,I can't reproduce the result like that. values1的第三个输出是(2,0,0) ,此时它读取x=2 ,并且y=z=0这意味着当y=0x已经等于 2,为什么第三个输出values2它读取x=0y=2 ,这意味着 x 是旧值,因为x、y、z正在增加,所以当y=2 ,x 至少为 2。我在我的 PC 上测试代码,我无法重现那样的结果。

The reason is that reading via x.load(std::memory_order_relaxed) guarantees only that you never see x decrease within the same thread (in this example code).原因是通过x.load(std::memory_order_relaxed)读取只能保证您永远不会在同一线程中看到x减少(在此示例代码中)。 (It also guarantees that a thread writing to x will read that same value again in the next iteration.) (它还保证写入 x 的线程将在下一次迭代中再次读取相同的值。)

In general, different threads can read different values from the same variable at the same time.一般来说,不同的线程可以同时从同一个变量中读取不同的值。 That is, there need not be a consistent "global state" that all threads agree on.也就是说,不需要所有线程都同意的一致“全局状态”。 The example output is supposed to demonstrate that: The first thread might still see y = 0 when it already wrote x = 4 , while the second thread might still see x = 0 when it already writes y = 2 .示例输出应该证明:第一个线程在已经写入x = 4时可能仍会看到y = 0 ,而第二个线程在已经写入y = 2时可能仍会看到x = 0 The standard allows this because real hardware may work that way: Consider the case when the threads are on different CPU cores, each with its own private L1 cache.该标准允许这样做,因为真正的硬件可能以这种方式工作:考虑线程位于不同 CPU 内核上的情况,每个内核都有自己的私有 L1 缓存。

However, it is not possible that the second thread sees x = 5 and then later sees x = 2 - the atomic object always guarantees that there is a consistent global modification order (that is, all writes to the variable are observed to happen in the same order by all the threads).但是,不可能第二个线程看到x = 5然后再看到x = 2原子对象总是保证有一致的全局修改顺序(也就是说,观察到所有对变量的写入都发生在所有线程的顺序相同)。

But when using std::memory_order_relaxed there are no guarantees about when a thread finally does "see" those writes*, or how the observations of different threads relate to each other.但是当使用std::memory_order_relaxed ,无法保证线程何时最终“看到”这些写入*,或者不同线程的观察结果如何相互关联。 You need stronger memory ordering to get those guarantees.您需要更强的内存排序才能获得这些保证。

*In fact, a valid output would be all threads reading only 0 all the time, except the writer threads reading what they wrote the previous iteration to their "own" variable (and 0 for the others). *事实上,一个有效的输出将是所有线程一直只读取0 ,除了写入线程读取他们在前一次迭代中写入的内容到他们的“自己的”变量(其他线程为 0)。 On hardware that never flushed caches unless prompted, this might actually happen, and it would be fully compliant with the C++ standard!在除非提示否则从不刷新缓存的硬件上,这实际上可能会发生,并且它将完全符合 C++ 标准!

And I test the code in my PC,I can't reproduce the result like that.我在我的电脑上测试了代码,我无法重现那样的结果。

The "example output" shown is highly artificial.显示的“示例输出”是高度人为的。 The C++ standard allows for this output to happen. C++ 标准允许这种输出发生。 This means you can write efficient and correct multithreaded code even on hardware with no inbuilt guarantees on cache coherency (see above).这意味着您甚至可以在没有内置缓存一致性保证的硬件上编写高效且正确的多线程代码(见上文)。 But common hardware today (x86 in particular) brings a lot of guarantees that actually make certain behavior impossible to observe (including the output in the question).但是今天的通用硬件(特别是 x86)带来了很多保证,实际上使某些行为无法观察(包括问题中的输出)。

Also, note that x , y and z are extremely likely to be adjacent (depends on the compiler), meaning they will likely all land on the same cache line.另外,请注意xyz极有可能是相邻的(取决于编译器),这意味着它们很可能都位于同一缓存行上。 This will lead to massive performance degradation (look up "false sharing").这将导致性能大幅下降(查找“虚假共享”)。 But since memory can only be transferred between cores at cache line granularity, this (together with the x86 coherency guarantees) makes it essentially impossible that an x86 CPU (which you most likely performed your tests with) reads outdated values of any of the variables.但是由于内存只能以缓存线粒度在内核之间传输,这(连同 x86 一致性保证)使得 x86 CPU(您最有可能使用它执行测试)读取任何变量的过时值基本上是不可能的。 Allocating these values more than 1-2 cache lines apart will likely lead to more interesting/chaotic results.将这些值分配超过 1-2 个缓存行可能会导致更有趣/混乱的结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM