OpenMP和C ++并行for循环：为什么我的代码在使用OpenMP时会变慢？

Question

I have a simple question about using OpenMP (with C++) that I hoped someone could help me with. 我有一个关于使用OpenMP（使用C ++）的简单问题，我希望有人可以帮助我。 I've included a small example below to illustrate my problem. 我在下面添加了一个小例子来说明我的问题。

#include<iostream>
#include<vector>
#include<ctime>
#include<omp.h>

using namespace std;

int main(){
  srand(time(NULL));//Seed random number generator                                                                               

  vector<int>v;//Create vector to hold random numbers in interval [0,9]                                                                                   
  vector<int>d(10,0);//Vector to hold counts of each integer initialized to 0                                                                    

  for(int i=0;i<1e9;++i)
    v.push_back(rand()%10);//Push back random numbers [0,9]                                                                      

  clock_t c=clock();

  #pragma omp parallel for
  for(int i=0;i<v.size();++i)
    d[v[i]]+=1;//Count number stored at v[i]                                                                                     

  cout<<"Seconds: "<<(clock()-c)/CLOCKS_PER_SEC<<endl;

  for(vector<int>::iterator i=d.begin();i!=d.end();++i)
  cout<<*i<<endl;

  return 0;
}

The above code creates a vector v that contains 1 billion random integers in the range [0,9] . 上面的代码创建了一个向量v ，它包含[0,9]范围内的10亿个随机整数。 Then, the code loops through v counting how many instances of each different integer there is (ie, how many ones are found in v , how many twos, etc.) 然后，代码循环通过v计算每个不同整数的实例数（即，在v中找到多少个，有多少两个，等等）

Each time a particular integer is encountered, it is counted by incrementing the appropriate element of a vector d . 每次遇到特定整数时，通过递增向量d的适当元素来计算它。 So, d[0] counts how many zeroes, d[6] counts how many sixes, and so on. 因此， d[0]计算多少个零， d[6]计算多少个六，依此类推。 Make sense so far? 到目前为止有道理吗？

My problem is when I try to make the counting loop parallel. 我的问题是当我尝试使计数循环并行时。 Without the #pragma OpenMP statement, my code takes 20 seconds, yet with the pragma it takes over 60 seconds. 如果没有#pragma OpenMP语句，我的代码需要20秒，但是使用pragma需要60秒。

Clearly, I've misunderstood some concept relating to OpenMP (perhaps how data is shared/accessed?). 很明显，我误解了一些与OpenMP相关的概念（也许是如何共享/访问数据的？）。 Could someone explain my error please or point me in the direction of some insightful literature with appropriate keywords to help my search? 有人可以解释我的错误，或者指点我用一些有见识的文献和适当的关键词来帮我搜索？

Answer 1

Your code exibits: 你的代码exibits：

race conditions due to unsyncronised access to a shared variable 由于未经同步访问共享变量而导致的竞争条件
false and true sharing cache problems 错误和真实的共享缓存问题
wrong measurement of run time 错误的运行时间测量

Race conditions arise because you are concurrently updating the same elements of vector d in multiple threads. 竞争条件的出现是因为您在多个线程中同时更新向量d的相同元素。 Comment out the srand() line and run your code several times with the same number of threads (but with more than one thread). 注释掉srand()行并使用相同数量的线程（但具有多个线程）多次运行代码。 Compare the outputs from different runs. 比较不同运行的输出。

False sharing occurs when two threads write to memory locations that are close to one another as to result on the same cache line. 当两个线程写入彼此接近的内存位置以产生相同的高速缓存行时，就会发生错误共享。 This results in the cache line constantly bouncing from core to core or CPU to CPU in multisocket systems and excess of cache coherency messages. 这导致高速缓存行在多串口系统中不断地从核心跳转到核心或CPU到CPU，以及过多的高速缓存一致性消息。 With 32 bytes per cache line 8 elements of the vector could fit in one cache line. 每个高速缓存行有32个字节，向量的8个元素可以放在一个高速缓存行中。 With 64 bytes per cache line the whole vector d fits in one cache line. 每个高速缓存行有64个字节，整个向量d适合一个高速缓存行。 This makes the code slow on Core 2 processors and slightly slower (but not as slow as on Core 2) on Nehalem and post-Nehalem (eg Sandy Bridge) ones. 这使得Core 2处理器上的代码变慢，而Nehalem和后Nehalem（例如Sandy Bridge）上的代码稍慢（但不像Core 2那么慢）。 True sharing occurs at those elements that are accesses by two or more threads at the same time. 真正的共享发生在两个或多个线程同时访问的元素上。 You should either put the increment in an OpenMP atomic construct (slow), use an array of OpenMP locks to protect access to elements of d (faster or slower, depending on your OpenMP runtime) or accumulate local values and then do a final synchronised reduction (fastest). 您应该将增量放在OpenMP atomic构造中（慢），使用OpenMP锁定数组来保护对d元素的访问（更快或更慢，取决于您的OpenMP运行时）或累积本地值，然后执行最终的同步减少（最快的）。 The first one is implemented like this: 第一个实现如下：

#pragma omp parallel for
for(int i=0;i<v.size();++i)
  #pragma omp atomic
  d[v[i]]+=1;//Count number stored at v[i]

The second is implemented like this: 第二个实现如下：

omp_lock_t locks[10];
for (int i = 0; i < 10; i++)
  omp_init_lock(&locks[i]);

#pragma omp parallel for
for(int i=0;i<v.size();++i)
{
  int vv = v[i];
  omp_set_lock(&locks[vv]);
  d[vv]+=1;//Count number stored at v[i]
  omp_unset_lock(&locks[vv]);
}

for (int i = 0; i < 10; i++)
  omp_destroy_lock(&locks[i]);

(include omp.h to get access to the omp_* functions) （包括omp.h以访问omp_*函数）

I leave it up to you to come up with an implementation of the third option. 我让你想出第三个选项的实现。

You are measuring elapsed time using clock() but it measures the CPU time, not the runtime. 您正在使用clock()测量已用时间，但它会测量CPU时间，而不是运行时间。 If you have one thread running at 100% CPU usage for 1 second, then clock() would indicata an increase in CPU time of 1 second. 如果一个线程以100％的CPU使用率运行1秒钟，则clock()会指示CPU时间增加1秒。 If you have 8 threads running at 100% CPU usage for 1 second, clock() would indicate an increate in CPU time of 8 seconds (that is 8 threads times 1 CPU second per thread). 如果您有8个线程以100％CPU使用率运行1秒钟，则clock()将指示CPU时间增加8秒（即8个线程乘以每个线程1个CPU秒）。 Use omp_get_wtime() or gettimeofday() (or some other high resolution timer API) instead. 请改用omp_get_wtime()或gettimeofday() （或其他一些高分辨率计时器API）。

Answer 2

EDIT Once your race condition is resolved via correct synchronization, then the following paragraph applies, before that your data race conditions unfortunately make speed comparisons mute: 编辑一旦通过正确的同步解决了竞争条件，则适用以下段落，在此之前您的数据竞争条件不幸地使速度比较静音：

Your program is slowing down because you have 10 possible outputs during the pragma section which are being accessed randomly. 您的程序正在变慢，因为在pragma部分中有10个可能的输出随机访问。 OpenMP cannot access any of those elements without a lock (which you would need to provide via synchronization) as a result and locking will cause your threads to have a higher overhead than you gain from counting in parallel. 没有锁定（您需要通过同步提供），OpenMP无法访问任何这些元素，并且锁定将导致您的线程具有比并行计数更高的开销。

A solution to make this speed up, is to instead make a local variable for each OpenMP thread which counts all of the 0-10 values that a particular thread has seen. 提高速度的一个解决方案是为每个OpenMP线程创建一个局部变量，该局部变量计算特定线程所见的所有0-10值。 Then sum those up in the master count vector. 然后在主计数向量中对它们求和。 This will be easily parallelized and much faster as the threads don't need to lock on a shared write vector. 由于线程不需要锁定共享写向量，因此这将很容易并行化并且更快。 I would expect a close to Nx speed up where N is the number of threads from OpenMP as there should be very limited locking required. 我希望接近Nx加速，其中N是来自OpenMP的线程数，因为需要非常有限的锁定。 This solution also avoids a lot of the race conditions currently in your code. 此解决方案还避免了代码中当前的许多竞争条件。

See http://software.intel.com/en-us/articles/use-thread-local-storage-to-reduce-synchronization/ for more details on thread local OpenMP 有关线程本地OpenMP的更多详细信息，请参见http://software.intel.com/en-us/articles/use-thread-local-storage-to-reduce-synchronization/

OpenMP和C ++并行for循环：为什么我的代码在使用OpenMP时会变慢？

问题描述

2 个解决方案

解决方案1
6 2012-07-25 17:39:29

解决方案2
1 2012-07-25 16:57:32

OpenMP和C ++并行for循环：为什么我的代码在使用OpenMP时会变慢？

问题描述

2 个解决方案

解决方案1 6 2012-07-25 17:39:29

解决方案2 1 2012-07-25 16:57:32

解决方案1
6 2012-07-25 17:39:29

解决方案2
1 2012-07-25 16:57:32