简体   繁体   English

为什么我的OpenMP实现比单线程实现慢?

[英]Why is my OpenMP implementation slower than a single threaded implementation?

I am learning about OpenMP concurrency, and tried my hand at some existing code I have. 我正在学习OpenMP并发性,并尝试了我现有的一些代码。 In this code, I tried to make all the for loops parallel. 在这段代码中,我尝试将所有for循环并行化。 However, this seems to make the program MUCH slower, at least 10x slower, or even more than the single threaded version. 但是,这似乎使程序更慢,比单线程版本慢10倍甚至更多。

Here is the code: http://pastebin.com/zyLzuWU2 这是代码: http//pastebin.com/zyLzuWU2

I also used pthreads, which turns out to be faster than the single threaded version. 我也使用了pthreads,结果比单线程版本更快。

Now the question is, what am I doing wrong in my OpenMP implementation that is causing this slowdown? 现在的问题是,在我的OpenMP实现中我做错了什么导致了这种放缓?

Thanks! 谢谢!

edit: the single threaded version is just the one without all the #pragmas 编辑:单线程版本只是没有所有#pragmas的版本

One problem I see with your code is that you are using OpenMP across loops that are very small (8 or 64 iterations, for example). 我在您的代码中看到的一个问题是,您在非常小的循环(例如,8或64次迭代)中使用OpenMP。 This will not be efficient due to overheads. 由于开销,这将无效。 If you want to use OpenMP for the n-queens problem, look at OpenMP 3.0 tasks and thread parallelism for branch-and-bound problems. 如果要将OpenMP用于n-queens问题,请查看OpenMP 3.0任务和线程并行性以解决分支绑定问题。

I think your code is much too complex to be reviewed here. 我认为您的代码太复杂了,无法在此处进行审核。 One error that I saw immediately is that it is not even correct. 我立即看到的一个错误是它甚至不正确。 At places where you are using an omp parallel for to do sums you must use reduction(+: yourcountervariable) to have the results of the different threads correctly assembled together. 在使用omp parallel for执行求和的地方,必须使用reduction(+: yourcountervariable)将不同线程的结果正确组合在一起。 Otherwise one thread may overwrite the result of the others. 否则,一个线程可能会覆盖其他线程的结果。

At least two reasons: 至少有两个原因:

  1. You're only doing 8 iterations of a very simple loop. 你只做了一个非常简单的循环的8次迭代。 Your runtime will be completely dominated by the overhead involved in setting up all the threads. 您的运行时将完全由设置所有线程所涉及的开销所主导。

  2. In some places, the critical section will cause contention; 在某些地方, critical部分会引起争议; all the threads will be trying to access the critical section continuously, and block each other. 所有线程都将尝试连续访问临界区,并相互阻塞。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么我的OpenMP实现比单线程实现慢? (跟进) - Why is my OpenMP implementation slower than a single threaded implementation? (Followup) Quicksort - 为什么我的荷兰旗实现比我的Hoare-2分区实现慢? - Quicksort - why is my dutch-flag implementation slower than my Hoare-2-partition implementation? 为什么快速排序的这种实现比qsort慢? - Why is this implementation of Quick Sort slower than qsort? 为什么 MPI 和 OpenMP 合并排序比我的顺序代码慢? - Why MPI and OpenMP Merge Sort are slower than my sequential code? 为什么我的 selectionSort 实现比我的 bubbleSort 实现快? - Why is my implementation of selectionSort faster than my implementation of bubbleSort? OpenMP部分比单线程运行得慢 - OpenMP sections run slower than single thread Dijkstra算法OpenMP比单线程慢 - Dijkstra Algorithm OpenMP Slower than Single Thread 为什么POSIX线程比OpenMP慢 - Why POSIX Threads are Slower Than OpenMP OpenMP和GSL RNG - 性能问题 - 4个线程实现比纯序列1(四核CPU)慢10倍 - OpenMP and GSL RNG - Performance Issue - 4 threads implementation 10x slower than pure sequential one (quadcore CPU) 模运算符比手动执行慢? - Modulo operator slower than manual implementation?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM