简体   繁体   English

C ++ OpenMP指令

[英]C++ OpenMP directives

I have a loop that I'm trying to parallelize and in it I am filling a container, say an STL map. 我有一个循环,我正在尝试并行化,并在其中我填充容器,说一个STL地图。 Consider then the simple pseudo code below where T1 and T2 are some arbitrary types, while f and g are some functions of integer argument, returning T1, T2 types respectively: 然后考虑下面的简单伪代码,其中T1和T2是一些任意类型,而f和g是整数参数的一些函数,分别返回T1,T2类型:

#pragma omp parallel for schedule(static) private(i) shared(c)
for(i = 0; i < N; ++i) {
   c.insert(std::make_pair<T1,T2>(f(i),g(i))
}

This looks rather straighforward and seems like it should be trivially parallelized but it doesn't speed up as I expected. 这看起来相当直观,似乎应该是平凡的并行化,但它并没有像我预期的那样加速。 On the contrary it leads to run-time errors in my code, due to unexpected values being filled in the container, likely due to race conditions. 相反,它导致我的代码中的运行时错误,因为容器中填充了意外值,可能是由于竞争条件。 I've even tried putting barriers and what-not, but all to no-avail. 我甚至尝试过设置障碍而不是什么,但一切都没有用。 The only thing that allows it to work is to use a critical directive as below: 允许它工作的唯一方法是使用如下的关键指令:

#pragma omp parallel for schedule(static) private(i) shared(c)
for(i = 0; i < N; ++i) {
#pragma omp critical
   {
      c.insert(std::make_pair<T1,T2>(f(i),g(i))
   }
}

But this sort of renders useless the whole point of using omp in the above example, since only one thread at a time is executing the bulk of the loop (the container insert statement). 但是这种渲染在上面的例子中使用omp的全部意义毫无用处,因为一次只有一个线程正在执行循环的大部分(容器插入语句)。 What am I missing here? 我在这里错过了什么? Short of changing the way the code is written, can somebody kindly explain? 如果不改变编写代码的方式,有人可以解释一下吗?

This particular example you have is not a good candidate for parallelism unless f() and g() are extremely expensive function calls. 除非f()g()是极其昂贵的函数调用,否则这个特殊的例子并不适合并行化。

  1. STL containers are not thread-safe. STL容器不是线程安全的。 That's why you're getting the race conditions. 这就是你获得比赛条件的原因。 So accessing them needs to be synchronized - which makes your insertion process inherently sequential. 因此,访问它们需要同步 - 这使您的插入过程本身是顺序的。

  2. As the other answer mentions, there's a LOT of overhead for parallelism. 正如另一个答案所提到的那样,并行性有很多开销。 So unless f() and g() extremely expensive, your loop doesn't do enough work to offset the overhead of parallelism. 因此,除非f()g()非常昂贵,否则你的循环没有做足够的工作来抵消并行性的开销。

Now assuming f() and g() are extremely expensive calls, then your loop can be parallelized like this: 现在假设f()g()是非常昂贵的调用,那么你的循环可以像这样并行化:

#pragma omp parallel for schedule(static) private(i) shared(c)
    for(i = 0; i < N; ++i) {
        std::pair<T1,T2> p = std::make_pair<T1,T2>(f(i),g(i));

#pragma omp critical
       {
          c.insert(p);
       }
    }

Running multithreaded code make you think about thread safety and shared access to your variables. 运行多线程代码可以让您考虑线程安全性和对变量的共享访问。 As long as you start inserting into c from multiple threads, the collection should be prepared to take such "simultaneous" calls and keep its data consistent, are you sure it is made this way? 只要你从多个线程开始插入c ,集合应该准备好进行这种“同时”调用并保持其数据一致,你确定它是这样做的吗?

Another thing is that parallelization has its own overhead and you are not going to gain anything when you try to run a very small task on multiple threads - with the cost of splitting and synchronization you might end up with even higher total execution time for the task. 另一件事是并行化有自己的开销,当你尝试在多个线程上运行一个非常小的任务时你不会获得任何东西 - 分裂和同步的成本可能会导致任务的总执行时间更长。

  1. c will have obviously data races , as you guessed. 正如你猜测的那样, c会有明显的数据竞争。 STL map is not thread-safe. STL映射不是线程安全的。 Calling insert method concurrently in multiple threads will have very unpredictable behavior, mostly just crash. 在多个线程中同时调用insert方法将产生非常不可预测的行为,大多数情况下只是崩溃。

  2. Yes, to avoid the data races, you must have either (1) a mutex like #pragma omp critical , or (2) concurrent data structure (aka look-free data structures). 是的,为避免数据争用,您必须拥有(1)像#pragma omp critical这样的互斥锁,或者(2) 并发数据结构 (也称为无外观数据结构)。 However, not all data structures can be lock-free in current hardware. 但是,并非所有数据结构都可以在当前硬件中无锁定。 For example, TBB provides tbb::concurrent_hash_map . 例如,TBB提供tbb::concurrent_hash_map If you don't need ordering of the keys, you may use it and could get some speedup as it does not have a conventional mutex. 如果您不需要订购密钥,您可以使用它并且可以获得一些加速,因为它没有传统的互斥锁。

  3. In case where you can use just a hash table and the table is very huge, you could take a reduction-like approach (See this link for the concept of reduction). 如果您只使用哈希表并且表格非常庞大,那么您可以采用类似于减少的方法(请参阅此链接以了解减少的概念)。 Hash tables do not care about the ordering of the insertion. 哈希表不关心插入的顺序。 In this case, you allocate multiple hash tables for each thread, and let each thread inserts N/#thread items in parallel, which will give a speedup. 在这种情况下,您为每个线程分配多个哈希表,并让每个线程并行插入N /#个线程项,这将提供加速。 Looking up is also can be easily done by accessing these tables in parallel. 通过并行访问这些表也可以轻松完成查找。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM