简体   繁体   English

如何使用OpenMP并行化通过C ++ std :: list的for循环?

[英]How do I parallelize a for loop through a C++ std::list using OpenMP?

I would like to iterate through all elements in an std::list in parallel fashion using OpenMP. 我想使用OpenMP以并行方式遍历std :: list中的所有元素。 The loop should be able to alter the elements of the list. 循环应该能够改变列表的元素。 Is there a simple solution for this? 有一个简单的解决方案吗? It seems that OpenMP 3.0 supports parallel for loops when the iterator is a Random Access Iterator, but not otherwise. 当迭代器是随机访问迭代器时,似乎OpenMP 3.0支持并行for循环,但不是其他。 In any case, I would prefer to use OpenMP 2.0 as I don't have full control over which compilers are available to me. 无论如何,我更喜欢使用OpenMP 2.0,因为我无法完全控制哪些编译器可供我使用。

If my container were a vector, I might use: 如果我的容器是矢量,我可能会使用:

#pragma omp parallel for
for (auto it = v.begin(); it != v.end(); ++it) {
    it->process();
}

I understand that I could copy the list into a vector, do the loop, then copy everything back. 我知道我可以将列表复制到矢量中,执行循环,然后将所有内容复制回来。 However, I would like to avoid this complexity and overhead if possible. 但是,如果可能的话,我想避免这种复杂性和开销。

If you decide to use Openmp 3.0 , you can use the task feature: 如果您决定使用Openmp 3.0 ,则可以使用task功能:

#pragma omp parallel
#pragma omp single
{
  for(auto it = l.begin(); it != l.end(); ++it)
     #pragma omp task firstprivate(it)
       it->process();
  #pragma omp taskwait
}

This will execute the loop in one thread, but delegate the processing of elements to others. 这将在一个线程中执行循环,但将元素的处理委托给其他人。

Without OpenMP 3.0 the easiest way would be writing all pointers to elements in the list (or iterators in a vector and iterating over that one. This way you wouldn't have to copy anything back and avoid the overhead of copying the elements themselves, so it shouldn't have to much overhead: 没有OpenMP 3.0 ,最简单的方法是编写列表中元素的所有指针(或者向量中的迭代器并迭代那个。这样你就不必复制任何东西了,避免了复制元素本身的开销,所以它不应该有太大的开销:

std::vector<my_element*> elements; //my_element is whatever is in list
for(auto it = list.begin(); it != list.end(); ++it)
  elements.push_back(&(*it));

#pragma omp parallel shared(chunks)
{
  #pragma omp for
  for(size_t i = 0; i < elements.size(); ++i) // or use iterators in newer OpenMP
      elements[i]->process();
}

If you want to avoid copying even the pointers, you can always create a parallelized for loop by hand. 如果你想避免复制指针,你总是可以手动创建一个并行化的for循环。 You can either have the threads access interleaved elements of the list (as proposed by KennyTM) or split the range in roughly equal contious parts before iterating and iterating over those. 您可以让线程访问列表中的交错元素(由KennyTM提出),也可以在迭代和迭代之前将范围拆分为大致相等的连续部分。 The later seems preferable since the threads avoid accessing listnodes currently processed by other threads (even if only the next pointer), which could lead to false sharing. 后者似乎更可取,因为线程避免访问当前由其他线程处理的列表节点(即使只有下一个指针),这可能导致错误共享。 This would look roughly like this: 这看起来大致如下:

#pragma omp parallel
{
  int thread_count = omp_get_num_threads();
  int thread_num   = omp_get_thread_num();
  size_t chunk_size= list.size() / thread_count;
  auto begin = list.begin();
  std::advance(begin, thread_num * chunk_size);
  auto end = begin;
  if(thread_num = thread_count - 1) // last thread iterates the remaining sequence
     end = list.end();
  else
     std::advance(end, chunk_size);
  #pragma omp barrier
  for(auto it = begin; it != end; ++it)
    it->process();
}

The barrier is not strictly needed, however if process mutates the processed element (meaning it is not a const method), there might be some sort of false sharing without it, if threads iterate over a sequence which is already being mutated. 屏障不是严格需要的,但是如果process改变了已处理的元素(意味着它不是const方法),如果线程迭代已经被突变的序列,则可能存在某种错误共享而没有它。 This way will iterate 3*n times over the sequence (where n is the number of threads), so scaling might be less then optimal for a high number of threads. 这种方式将在序列上迭代3 * n次(其中n是线程数),因此对于大量线程,缩放可能不是最优的。

To reduce the overhead you could put the generation of the ranges outside of the #pragma omp parallel , however you will need to know how many threads will form the parallel section. 为了减少开销,可以将#pragma omp parallel之外的范围生成#pragma omp parallel ,但是您需要知道将形成并行部分的线程数。 So you'd probably have to manually set the num_threads , or use omp_get_max_threads() and handle the case that the number of threads created is less then omp_get_max_threads() (which is only an upper bound). 因此,您可能必须手动设置num_threads ,或使用omp_get_max_threads()并处理创建的线程数小于omp_get_max_threads() (仅为上限)的情况。 The last way could be handled by possibly assigning each thread severa chunks in that case (using #pragma omp for should do that): 最后一种方法可以通过在这种情况下分配每个线程severa块来处理(使用#pragma omp for should):

int max_threads = omp_get_max_threads();
std::vector<std::pair<std::list<...>::iterator, std::list<...>::iterator> > chunks;
chunks.reserve(max_threads); 
size_t chunk_size= list.size() / max_threads;
auto cur_iter = list.begin();
for(int i = 0; i < max_threads - 1; ++i)
{
   auto last_iter = cur_iter;
   std::advance(cur_iter, chunk_size);
   chunks.push_back(std::make_pair(last_iter, cur_iter);
}
chunks.push_back(cur_iter, list.end();

#pragma omp parallel shared(chunks)
{
  #pragma omp for
  for(int i = 0; i < max_threads; ++i)
    for(auto it = chunks[i].first; it != chunks[i].second; ++it)
      it->process();
}

This will take only three iterations over list (two, if you can get the size of the list without iterating). 这将只在list三次迭代(两次,如果你可以获得列表的大小而不进行迭代)。 I think that is about the best you can do for non random access iterators without using tasks or iterating over some out of place datastructure (like a vector of pointer). 我认为这是关于非随机访问迭代器可以做的最好的事情,而不使用tasks或迭代一些不合适的数据结构(如指针向量)。

I doubt it is possible since you can't just jump into the middle of a list without traversing the list. 我怀疑这是可能的,因为你不能只是跳到列表的中间而不遍历列表。 Lists are not stored in contiguous memory and std::list iterators are not Random Access. 列表不存储在连续的内存中,而std :: list迭代器不是随机访问。 They are only bidirectional. 它们只是双向的。

http://openmp.org/forum/viewtopic.php?f=3&t=51 http://openmp.org/forum/viewtopic.php?f=3&t=51

#pragma omp parallel
{
   for(it= list1.begin(); it!= list1.end(); it++)
   {
      #pragma omp single nowait
      {
         it->compute();
      }
   } // end for
} // end ompparallel

This can be understood as unrolled as: 这可以理解为展开为:

{
  it = listl.begin
  #pragma omp single nowait
  {
    it->compute();
  }
  it++;
  #pragma omp single nowait
  {
    it->compute();
  }
  it++;
...
}

Given a code like this: 给出这样的代码:

int main()                                                                      
{                                                                               
        std::vector<int> l(4,0);                                                
        #pragma omp parallel for                                                        
        for(int i=0; i<l.size(); ++i){                                          
                printf("th %d = %d \n",omp_get_thread_num(),l[i]=i);            
        }                                                                       
        printf("\n");                                                           
       #pragma omp parallel                                                            
        {                                                                       
                for (auto i = l.begin(); i != l.end(); ++i) {                   
               #pragma omp single nowait                                                       
                {                                                       
                        printf("th %d = %d \n",omp_get_thread_num(),*i);
                }                                                       
            }                                                               
        }                                                                       
        return 0;                                                               
} 

export OMP_NUM_THREADS=4, output as follows (note the second section, work thread number can repeat): export OMP_NUM_THREADS = 4,输出如下(注意第二节,工作线程号可以重复):

th 2 = 2 
th 1 = 1 
th 0 = 0 
th 3 = 3 

th 2 = 0 
th 1 = 1 
th 2 = 2 
th 3 = 3

Without using OpenMP 3.0 you have the option of having all threads iterating over the list: 在不使用OpenMP 3.0的情况下,您可以选择让所有线程遍历列表:

std::list<T>::iterator it;
#pragma omp parallel private(it)
{
   for(it = list1.begin(); it!= list1.end(); it++)
   {
      #pragma omp single nowait
      {
         it->compute();
      }
   } 
} 

In this case each thread has its own copy of the iterator ( private ) but only a single thread will access a specific element ( single ) whereas the other threads will move forward to the next items ( nowait ) 在这种情况下,每个线程都有自己的迭代器副本( 私有 ),但只有一个线程将访问特定元素( 单个 ),而其他线程将前进到下一个项目( nowait

Or you can loop once to build a vector of pointers to be then distributed among threads: 或者你可以循环一次构建一个指针向量,然后在线程之间分配:

std::vector< T*> items;

items.reserve(list.size());
//put the pointers in the vector
std::transform(list.begin(), list.end(), std::back_inserter(items), 
               [](T& n){ return &n; }
);

#pragma omp parallel for
for (int i = 0; i < items.size(); i++)
{
  items[i]->compute();
}

Depending on your specific case one or the other can be faster. 根据您的具体情况,一个或另一个可以更快。 Testing which one suits you better is easy. 测试哪一个更适合你很容易。

Here is a solution which allows inserting/removing new elements of a list in parallel. 这是一个允许并行插入/删除列表的新元素的解决方案。

For a list with N elements we first cut the list into nthreads lists with roughly N/nthreads elements. 对于具有N元素的列表,我们首先将列表切割为具有大致N/nthreads元素的nthreads列表。 In a parallel region this can be done like this 在平行区域中,这可以这样做

int ithread = omp_get_thread_num();
int nthreads = omp_get_num_threads();
int t0 = (ithread+0)*N/nthreads;
int t1 = (ithread+1)*N/nthreads;

std::list<int> l2;
#pragma omp for ordered schedule(static)
for(int i=0; i<nthreads; i++) {
    #pragma omp ordered
    {
        auto it0 = l.begin(), it1 = it0;
        std::advance(it1, t1-t0);       
        l2.splice(l2.begin(), l2, it0, it1);
    }
}

Where l2 is the cut list for each thread. 其中l2是每个线程的剪切列表。

Then we can act on each list in parallel. 然后我们可以并行处理每个列表。 For example we can insert -1 every first position in the list like this 例如,我们可以像这样在列表中的每个第一个位置插入-1

auto it = l2.begin();
for(int i=(t0+4)/5; i<(t1+4)/5; i++) {
    std::advance(it, 5*i-t0);
    l2.insert(it, -1);
}

Finally, after we are doing operating on the lists in parallel we splice the lists for each thread back to one list in order like this: 最后,在我们对列表并行操作之后,我们将每个线程的列表按顺序拼接回一个列表,如下所示:

#pragma omp for ordered schedule(static)
for(int i=0; i<nthreads; i++) {
    #pragma omp ordered
    l.splice(l.end(), l, l2.begin(), l2.end());
}

The algorithm is essentially. 该算法基本上是。

  1. Fast-forward through list sequential making cut lists. 快进列表顺序制作剪切列表。
  2. Act on cut lists in parallel adding, modifying, or removing elements. 在并行添加,修改或删除元素的情况下执行剪切列表。
  3. Splice the modified cut lists back together sequential. 将修改后的切割列表顺序拼接在一起。

Here is a working example 这是一个有效的例子

#include <algorithm>
#include <iostream>
#include <list>
#include <omp.h>

int main(void) {
  std::list<int> l;
  for(int i=0; i<22; i++) {
    l.push_back(i);
  }
  for (auto it = l.begin(); it != l.end(); ++it) {
    std::cout << *it << " ";
  } std::cout << std::endl;

  int N = l.size();
  #pragma omp parallel
  {
    int ithread = omp_get_thread_num();
    int nthreads = omp_get_num_threads();
    int t0 = (ithread+0)*N/nthreads;
    int t1 = (ithread+1)*N/nthreads;

    //cut list into nthreads lists with size=N/nthreads
    std::list<int> l2;
    #pragma omp for ordered schedule(static)
    for(int i=0; i<nthreads; i++) {
      #pragma omp ordered
      {
    auto it0 = l.begin(), it1 = it0;
    std::advance(it1, t1-t0);       
    l2.splice(l2.begin(), l2, it0, it1);
      }
    }
    //insert -1 every 5th postion
    auto it = l2.begin();
    for(int i=(t0+4)/5; i<(t1+4)/5; i++) {
      std::advance(it, 5*i-t0);
      l2.insert(it, -1);
    }

    //splice lists in order back together.
    #pragma omp for ordered schedule(static)
    for(int i=0; i<nthreads; i++) {
      #pragma omp ordered
      l.splice(l.end(), l, l2.begin(), l2.end());
    }  
  }

  for (auto it = l.begin(); it != l.end(); ++it) {
    std::cout << *it << " ";
  } std::cout << std::endl;  
}

Result 结果

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 
-1 0 1 2 3 4 -1 5 6 7 8 9 -1 10 11 12 13 14 -1 15 16 17 18 19 -1 20 21

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM