简体   繁体   English

OpenMP atomic 比对数组的关键速度要慢得多

[英]OpenMP atomic substantially slower than critical for array

The examples that I have seen for OpenMP's omp atomic generally involve updating a scalar, and usually report that it is faster than omp critical .我看到的 OpenMP 的omp atomic示例通常涉及更新标量,并且通常报告它比omp critical更快。 In my application I wish to update elements of an allocated array, with some overlap between the elements that different threads will update, and I find that atomic is substantially slower than critical.在我的应用程序中,我希望更新已分配数组的元素,不同线程将更新的元素之间存在一些重叠,我发现 atomic 比关键要慢得多。 Does it make a difference that it is an array, and am I using it correctly?它是一个数组是否有区别,我是否正确使用它?

#include <stdlib.h>
#include <assert.h>
#include <omp.h>

#define N_EACH 10000000
#define N_OVERLAP 100000

#if !defined(OMP_CRITICAL) && !defined(OMP_ATOMIC)
#error Must define OMP_CRITICAL or OMP_ATOMIC
#endif
#if defined(OMP_CRITICAL) && defined(OMP_ATOMIC)
#error Must define only one of either OMP_CRITICAL or OMP_ATOMIC
#endif

int main(void) {

  int const n = omp_get_max_threads() * N_EACH -
                (omp_get_max_threads() - 1) * N_OVERLAP;
  int *const a = (int *)calloc(n, sizeof(int));

#pragma omp parallel
  {
    int const thread_idx = omp_get_thread_num();
    int i;
#ifdef OMP_CRITICAL
#pragma omp critical
#endif /* OMP_CRITICAL */
    for (i = 0; i < N_EACH; i++) {
#ifdef OMP_ATOMIC
#pragma omp atomic update
#endif /* OMP_ATOMIC */
      a[thread_idx * (N_EACH - N_OVERLAP) + i] += i;
    }
  }

/* Check result is correct */
#ifndef NDEBUG
  {
    int *const b = (int *)calloc(n, sizeof(int));
    int thread_idx;
    int i;
    for (thread_idx = 0; thread_idx < omp_get_max_threads(); thread_idx++) {
      for (i = 0; i < N_EACH; i++) {
        b[thread_idx * (N_EACH - N_OVERLAP) + i] += i;
      }
    }
    for (i = 0; i < n; i++) {
      assert(a[i] == b[i]);
    }
    free(b);
  }
#endif /* NDEBUG */

  free(a);
}

Note that in this simplified example we can determine in advance which elements will overlap, so it would be more efficient to only apply atomic / critical when updating those, but in my real application this is not possible.请注意,在这个简化的示例中,我们可以提前确定哪些元素将重叠,因此在更新这些元素时仅应用atomic / critical元素会更有效,但在我的实际应用程序中这是不可能的。

When I compile this using:当我使用以下方法编译它时:

  • gcc -O2 atomic_vs_critical.c -DOMP_CRITICAL -DNDEBUG -fopenmp -o critical
  • gcc -O2 atomic_vs_critical.c -DOMP_ATOMIC -DNDEBUG -fopenmp -o atomic

and run with time./critical I get: real 0m0.110s user 0m0.086s sys 0m0.058s并随时间运行time./critical我得到: real 0m0.110s user 0m0.086s sys 0m0.058s

and with time./atomic , I get: real 0m0.205s user 0m0.742s sys 0m0.032s随着time./atomic ,我得到: real 0m0.205s user 0m0.742s sys 0m0.032s

So it uses about half the wallclock time with the critical section (and I get the same when I repeat it).所以它在关键部分使用了大约一半的挂钟时间(当我重复它时我得到了同样的结果)。

There is another post that claims critical is slower than atomic , but that uses a scalar, and when I run the provided code the atomic result is actually slightly faster than the critical one.还有一篇文章声称 critical 比 atomic 慢,但是它使用了一个标量,当我运行提供的代码时, atomic 结果实际上比 critical 稍快。

Your comparison is not fair: the #pragma omp critical is placed before the for loop, so the compiler can vectorize your loop, but #pragma omp atomic update is inside the loop, which prevents vectorization.您的比较不公平: #pragma omp critical放在for循环之前,因此编译器可以矢量化您的循环,但#pragma omp atomic update在循环内,这会阻止矢量化。 This difference in vectorization causes the surprising runtimes.矢量化的这种差异导致了令人惊讶的运行时间。 For a fair comparison place both inside the loop:为了在循环内进行公平比较:

for (i = 0; i < N_EACH; i++) {
#ifdef OMP_CRITICAL
#pragma omp critical
#endif /* OMP_CRITICAL */
#ifdef OMP_ATOMIC
#pragma omp atomic update
#endif /* OMP_ATOMIC */
   a[thread_idx * (N_EACH - N_OVERLAP) + i] += i;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM