对粒子的 3D 直方图使用 OpenMP 原子捕获操作并制作索引的竞争条件

Question

I have a piece of code in my full code:我的完整代码中有一段代码：

const unsigned int GL=8000000;
const int cuba=8;
const int cubn=cuba+cuba;
const int cub3=cubn*cubn*cubn;
int Length[cub3];
int Begin[cub3];
int Counter[cub3];
int MIndex[GL];
struct Particle{
  int ix,jy,kz;
  int ip;
};
Particle particles[GL];
int GetIndex(const Particle & p){return (p.ix+cuba+cubn*(p.jy+cuba+cubn*(p.kz+cuba)));}    
...
#pragma omp parallel for
for(int i=0; i<cub3; ++i) Length[i]=Counter[i]=0;
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
  int ic=GetIndex(particles[i]);
  #pragma omp atomic update
  Length[ic]++;
}
Begin[0]=0;
#pragma omp single
for(int i=1; i<cub3; ++i) Begin[i]=Begin[i-1]+Length[i-1];
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
  if(particles[i].ip==3)
  {
    int ic=GetIndex(particles[i]);
    if(ic>cub3 || ic<0) printf("ic=%d out of range!\n",ic);
    int cnt=0;
  #pragma omp atomic capture
    cnt=Counter[ic]++;
    MIndex[Begin[ic]+cnt]=i;
  }
}

If to remove如果要删除

#pragma omp parallel for

the code works properly and the output results are always the same.代码正常工作，输出结果始终相同。 But with this pragma there is some undefined behaviour/race condition in the code, because each time it gives different output results.但是有了这个编译指示，代码中有一些未定义的行为/竞争条件，因为每次它都会给出不同的输出结果。 How to fix this issue?如何解决这个问题？

Update: The task is the following.更新：任务如下。 Have lots of particles with some random coordinates.有很多带有一些随机坐标的粒子。 Need to output to the array MIndex the indices in the array particles of the particles, which are in each cell (cartesian cube, for example, 1×1×1 cm) of the coordinate system.需要将粒子在数组中的索引输出到数组MIndex中，这些索引在坐标系的每个单元格（笛卡尔立方体，例如 1×1×1 cm）中。 So, in the beginning of MIndex there should be the indices in the array particles of the particles in the 1st cell of the coordinate system, then - in the 2nd, then - in the 3rd and so on.因此，在MIndex的开头应该有坐标系第一个单元格中粒子的数组粒子中的索引，然后是 - 在第二个单元中，然后是 - 在第三个等等。 The order of indices within given cell in the area MIndex is not important, may be arbitrary.区域MIndex中给定单元内的索引顺序并不重要，可以是任意的。 If it is possible, need to make this in parallel, may be using atomic operations.如果可能的话，需要使这个并行，可能是使用原子操作。

There is a straight way: to traverse across all the coordinate cells in parallel and in each cell check the coordinates of all the particles.有一条直接的方法：并行遍历所有坐标单元，并在每个单元中检查所有粒子的坐标。 But for large number of cells and particles this seems to be slow.但是对于大量的细胞和粒子，这似乎很慢。 Is there a faster approach?有更快的方法吗？ Is it possible to travel across the particles array only once in parallel and fill MIndex array using atomic operations, something like written in the code piece above?是否有可能只在粒子数组中并行移动一次并使用原子操作填充MIndex数组，就像上面代码中写的那样？

Answer 1

You probably can't get a compiler to auto -parallelize scalar code for you if you want an algorithm that can work efficiently (without needing atomic RMWs on shared counters which would be a disaster, see below).如果您想要一个可以有效工作的算法（不需要在共享计数器上使用原子 RMW，这将是一场灾难，请参见下文），您可能无法让编译器为您自动并行化标量代码。 But you might be able to use OpenMP as a way to start threads and get thread IDs.但是您也许可以使用 OpenMP 作为启动线程和获取线程 ID 的一种方式。

Keep per-thread count arrays from the initial histogram, use in 2nd pass保留初始直方图中的每个线程计数数组，在第二遍中使用

(Update: this might not work: I didn't notice the if(particles[i].ip==3) in the source before. I was assuming that Count[ic] will go as high as Length[ic] in the serial version. If that's not the case, this strategy might leave gaps or something. But as Laci points out, perhaps you want that check when calculating Length in the first place, then it would be fine.) （更新：这可能行不通：我之前没有注意到源中的if(particles[i].ip==3) 。我假设Count[ic]将与Length[ic]一样高串行版本。如果不是这种情况，这种策略可能会留下空白或其他东西。但正如 Laci 指出的那样，也许你首先在计算 Length 时想要检查，那么就可以了。）

Manually multi-thread the first histogram (into Length[] ), with each thread working on a known range of i values.手动对第一个直方图进行多线程处理（进入Length[] ），每个线程处理已知范围的i值。 Keep those per-thread lengths around, even as you sum across them and prefix-sum to build Begin[].保持这些每个线程的长度，即使你对它们求和并加上前缀求和来构建 Begin[]。

So Length[thread][ic] is the number of particles in that cube, out of the range of i values that this thread worked on.所以Length[thread][ic]是该立方体中的粒子数，超出了该线程处理的i值的范围。 (And will loop over again in the 2nd loop: the key is that we divide the particles between threads the same way twice. Ideally with the same thread working on the same range, so things may still be hot in L1d cache.) （并且将在第二个循环中再次循环：关键是我们以相同的方式在线程之间划分粒子两次。理想情况下，相同的线程在相同的范围内工作，因此 L1d 缓存中的东西可能仍然很热。）

Pre-process that into a per-thread Begin[][] array, so each thread knows where in MIndex to put data from each bucket.将其预处理到每个线程Begin[][]数组中，因此每个线程都知道在 MIndex 中的哪个位置放置来自每个存储桶的数据。

// pseudo-code, fairly close to actual C
for(ic < cub3) {
   // perhaps do this "vertical" sum into a temporary array
   // or prefix-sum within Length before combining across threads?
   int pos = sum(Length[0..nthreads-1][ic-1]) + Begin[0][ic-1];

   Begin[0][ic] = pos;
   for (int t = 1 ; t<nthreads ; t++) {
       pos += Length[t][ic];   // prefix-sum across threads for this cube bucket
       Begin[t][ic] = pos;
   }
}

This has a pretty terrible cache access pattern, especially with cuba=8 making Length[t][0] and Length[t+1][0] 4096 bytes apart from each other.这有一个非常糟糕的缓存访问模式，尤其是cuba=8使Length[t][0]和Length[t+1][0]彼此分开 4096 个字节。 (So 4k aliasing is a possible problem, as are cache conflict misses). （因此 4k 别名是一个可能的问题，缓存冲突未命中也是如此）。

Perhaps each thread can prefix-sum its own slice of Length into that slice of Begin, 1. for cache access pattern (and locality since it just wrote those Lengths), and 2. to get some parallelism for that work.也许每个线程都可以将自己的长度切片加到 Begin 切片中，1. 用于缓存访问模式（以及位置，因为它刚刚编写了这些长度），以及 2. 为该工作获得一些并行性。

Then in the final loop with MIndex , each thread can do int pos = --Length[t][ic] to derive a unique ID from the Length.然后在使用MIndex的最后一个循环中，每个线程都可以执行int pos = --Length[t][ic]从 Length 派生唯一 ID。 (Like you were doing with Count[] , but without introducing another per-thread array to zero.) （就像您对Count[]所做的那样，但没有将另一个每个线程数组引入零。）

Each element of Length will return to zero, because the same thread is looking at the same points it just counted. Length 的每个元素都将返回零，因为同一个线程正在查看它刚刚计算的相同点。 With correctly-calculated Begin[t][ic] positions, MIndex[...] = i stores won't conflict.使用正确计算的Begin[t][ic]位置， MIndex[...] = i存储不会发生冲突。 False sharing is still possible, but it's a large enough array that points will tend to be scattered around.错误共享仍然是可能的，但它是一个足够大的数组，点往往会分散在周围。

Don't overdo it with number of threads , especially if cuba is greater than 8. The amount of Length / Begin pre-processing work scales with number of threads, so it may be better to just leave some CPUs free for unrelated threads or tasks to get some throughput done.不要过度使用线程数，特别是当cuba大于 8 时。长度/开始预处理工作的数量与线程数成比例，因此最好只为不相关的线程或任务留出一些 CPU 空闲完成一些吞吐量。 OTOH, with cuba=8 meaning each per-thread array is only 4096 bytes (too small to parallelize the zeroing of, BTW), it's really not that much. OTOH， cuba=8意味着每个线程数组只有 4096 字节（太小而无法并行化归零，顺便说一句），实际上并没有那么多。

(Previous answer before your edit made it clearer what was going on.) （您的编辑之前的上一个答案更清楚地说明了发生了什么。）

Is this basically a histogram?这基本上是直方图吗？ If each thread has its own array of counts, you can sum them together at the end (you might need to do that manually, not have OpenMP do it for you).如果每个线程都有自己的计数数组，您可以在最后将它们相加（您可能需要手动执行此操作，而不是让 OpenMP 为您执行此操作）。 But it seems you also need this count to be unique within each voxel, to have MIndex updated properly?但似乎您还需要这个计数在每个体素中是唯一的，才能正确更新 MIndex？ That might be a showstopper, like requiring adjusting every MIndex entry, if it's even possible.如果可能的话，这可能是一个阻碍，比如需要调整每个 MIndex 条目。

After your update, you are doing a histogram into Length[], so that part can be sped up.更新后，您正在对 Length[] 进行直方图，以便可以加快该部分的速度。

Atomic RMWs would be necessary for your code as-is, performance disaster Atomic RMWs 对你的代码来说是必要的，性能灾难

Atomic increments of shared counters would be slower, and on x86 might destroy the memory-level parallelism too badly.共享计数器的原子增量会更慢，并且在 x86 上可能会严重破坏内存级并行性。 On x86, every atomic RMW is also a full memory barrier, draining the store buffer before it happens, and blocking later loads from starting until after it happens.在 x86 上，每个原子 RMW 也是一个完整的内存屏障，在它发生之前耗尽存储缓冲区，并阻止以后的加载开始直到它发生之后。

As opposed to a single thread which can have cache misses to multiple Counter , Begin and MIndex elements outstanding, using non-atomic accesses.与使用非原子访问可能对多个Counter 、 Begin和MIndex元素有缓存未命中的单个线程相反。 (Thanks to out-of-order exec, the next iteration's load / inc / store for Counter[ic]++ can be doing the load while there are cache misses outstanding for Begin[ic] and/or for Mindex[] stores.) （由于无序执行， Counter[ic]++的下一次迭代的加载/inc/存储可以在Begin[ic]和/或Mindex[]存储存在未完成的缓存未命中时进行加载。 )

ISAs that allow relaxed-atomic increment might be able to do this efficiently, like AArch64.允许宽松原子增量的 ISA 可能能够有效地做到这一点，例如 AArch64。 (Again, OpenMP might not be able to do that for you.) （同样，OpenMP 可能无法为您做到这一点。）

Even on x86, with enough (logical) cores, you might still get some speedup, especially if the Counter accesses are scattered enough they cores aren't constantly fighting over the same cache lines.即使在 x86 上，如果有足够的（逻辑）内核，您仍然可能会获得一些加速，特别是如果Counter访问足够分散，它们的内核不会经常争夺相同的缓存线。 You'd still get a lot of cache lines bouncing between cores, though, instead of staying hot in L1d or L2.但是，您仍然会在内核之间反弹很多缓存行，而不是在 L1d 或 L2 中保持热状态。 (False sharing is a problem, （虚假分享是个问题，

Perhaps software prefetch can help, like prefetchw (write-prefetching) the counter for 5 or 10 i iterations later.也许软件预取可以提供帮助，例如prefetchw （写预取）计数器在 5 或 10 i次迭代之后。

It wouldn't be deterministic which point went in which order , even with memory_order_seq_cst increments, though.但是，即使memory_order_seq_cst递增，哪个点按哪个顺序进行也不是确定性的。 Whichever thread increments Counter[ic] first is the one that associates that cnt with that i .无论哪个线程首先递增Counter[ic] ，都是将cnt与i相关联的线程。

Alternative access patterns替代访问模式

Perhaps have each thread scan all points, but only process a subset of them, with disjoint subsets.也许让每个线程扫描所有点，但只处理其中的一个子集，子集不相交。 So the set of Counter[] elements that any given thread touches is only touched by that thread, so the increments can be non-atomic.因此，任何给定线程接触的Counter[]元素集仅由该线程接触，因此增量可以是非原子的。

Filtering by p.kz ranges maybe makes the most sense since that's the largest multiplier in the indexing, so each thread "owns" a contiguous range of Counter[] .按p.kz范围过滤可能最有意义，因为这是索引中最大的乘数，因此每个线程“拥有”一个Counter[]的连续范围。

But if your points aren't uniformly distributed, you'd need to know how to break things up to approximately equally divide the work.但是，如果你的分数不是均匀分布的，你就需要知道如何将事情分解成大致均分的工作。 And you can't just divide it more finely (like OMP schedule dynamic), since each thread is going to scan through all the points: that would multiply the amount of filtering work.而且你不能只是更精细地划分它（如 OMP 调度动态），因为每个线程都将扫描所有点：这将增加过滤工作量。

Maybe a couple fixed partitions would be a good tradeoff to gain some parallelism without introducing a lot of extra work.也许几个固定的分区是一个很好的折衷方案，可以在不引入大量额外工作的情况下获得一些并行性。

Re: your edit回复：您的编辑

You already loop over the whole array of points doing Length[ic]++;您已经循环遍历整个点数组Length[ic]++; ? ? Seems redundant to do the same histogramming work again with Counter[ic]++;用Counter[ic]++; , but not obvious how to avoid it. ，但不明显如何避免它。

The count arrays are small, but if you don't need both when you're done, you could maybe just decrement Length to assign unique indices to each point in a voxel.计数数组很小，但如果您完成后不需要两者，您可能只需减少 Length 以将唯一索引分配给体素中的每个点。 At least the first histogram could benefit from parallelizing with different count arrays for each thread, and just vertically adding at the end.至少第一个直方图可以受益于每个线程的不同计数数组的并行化，并在最后垂直添加。 Should scale perfectly with threads since the count array is small enough for L1d cache.由于计数数组对于 L1d 缓存来说足够小，因此应该与线程完美扩展。

BTW, for() Length[i]=Counter[i]=0;顺便说一句， for() Length[i]=Counter[i]=0; is too small to be worth parallelizing .太小了，不值得并行化。 For cuba=8 , it's 8*8*16 * sizeof(int) = 4096 bytes, just one page, so it's just two small memsets.对于cuba=8 ，它是8*8*16 * sizeof(int) = 4096 字节，只有一页，所以它只是两个小的 memset。

(Of course if each thread has their own separate Length array, they each need to zero it). （当然，如果每个线程都有自己独立的 Length 数组，则每个线程都需要将其归零）。 That's small enough to even consider unrolling with maybe 2 count arrays per thread to hide store/reload serial dependencies if a long sequence of points are all in the same bucket.如果一长串点都在同一个存储桶中，这甚至可以考虑使用每个线程可能 2 个计数数组展开以隐藏存储/重新加载串行依赖项，这已经足够小了。 Combining count arrays at the end is a job for #pragma omp simd or just normal auto-vectorization with gcc -O3 -march=native since it's integer work.最后组合计数数组是#pragma omp simd的工作，或者只是使用gcc -O3 -march=native进行正常的自动矢量化，因为它是整数工作。

For the final loop, you could split your points array in half (assign half to each thread), and have one thread get unique IDs by counting down from --Length[i] , and another counting up from 0 in Counter[i]++ .对于最后一个循环，您可以将 points 数组分成两半（将一半分配给每个线程），并让一个线程通过从--Length[i]倒数获得唯一 ID，另一个从Counter[i]++中的 0 计数Counter[i]++ 。 With different threads looking at different points, this could give you a factor of 2 speedup.使用不同的线程查看不同的点，这可以为您提供 2 倍的加速。 (Modulo contention for MIndex stores.) （MIndex 存储的模争用。）

To do more than just count up and down, you'd need info you don't have from just the overall Length array... but which you did have temporarily.要做的不仅仅是向上和向下计数，您需要仅从整个 Length 数组中获得的信息......但您确实暂时拥有这些信息。 See the section at the top请参阅顶部的部分

Answer 2

You are right to make the update Counter[ic]++ atomic, but there is an additional problem on the next line: MIndex[Begin[ic]+cnt]=i;使更新Counter[ic]++ atomic 是对的，但是下一行还有一个问题： MIndex[Begin[ic]+cnt]=i; Different iterations can write into the same location here, unless you have mathematical proof that this is never the case from the structure of MIndex .不同的迭代可以在此处写入相同的位置，除非您有数学证明从MIndex的结构中永远不会出现这种情况。 So you have to make that line atomic too.所以你也必须使那条线原子化。 And then there is almost no parallel work left in your loop, so your speed up if probably going to be abysmal.然后你的循环中几乎没有并行工作，所以如果可能会很糟糕，你的速度就会加快。

EDIT the second line however is not of the right form for an atomic operation, so you have to make it critical . EDIT 第二行但是对于原子操作来说不是正确的形式，所以你必须使它成为critical 。 Which is going to make performance even worse.这将使性能变得更糟。

Also, @Laci is correct that since this is an overwrite statement, the order of parallel scheduling is going to influence the outcome.此外，@Laci 是正确的，因为这是一个覆盖语句，并行调度的顺序会影响结果。 So either live with that fact, or accept that this can not be parallelized.所以要么接受这个事实，要么接受这不能并行化。

对粒子的 3D 直方图使用 OpenMP 原子捕获操作并制作索引的竞争条件

问题描述

2 个解决方案

解决方案1
6 已采纳 2022-06-11 21:27:57

Keep per-thread count arrays from the initial histogram, use in 2nd pass保留初始直方图中的每个线程计数数组，在第二遍中使用

Atomic RMWs would be necessary for your code as-is, performance disaster Atomic RMWs 对你的代码来说是必要的，性能灾难

Alternative access patterns替代访问模式

Re: your edit回复：您的编辑

解决方案2
3 2022-06-10 13:41:37

对粒子的 3D 直方图使用 OpenMP 原子捕获操作并制作索引的竞争条件

问题描述

2 个解决方案

解决方案1 6 已采纳 2022-06-11 21:27:57

Keep per-thread count arrays from the initial histogram, use in 2nd pass保留初始直方图中的每个线程计数数组，在第二遍中使用

Atomic RMWs would be necessary for your code as-is, performance disaster Atomic RMWs 对你的代码来说是必要的，性能灾难

Alternative access patterns替代访问模式

Re: your edit回复：您的编辑

解决方案2 3 2022-06-10 13:41:37

解决方案1
6 已采纳 2022-06-11 21:27:57

解决方案2
3 2022-06-10 13:41:37