简体   繁体   English

使用OpenMP并行化时的性能问题

[英]Performance issue while parallelizing using OpenMP

I was trying to parallelize a code but it only deteriorated the performance. 我试图并行化一个代码,但是它只会降低性能。 I wrote a Fortran code which runs several Monte Carlo integrations and then finds their mean. 我编写了一个Fortran代码,该代码运行了几个蒙特卡洛积分,然后找到了平均值。

      implicit none  
      integer, parameter :: n=100
      integer, parameter :: m=1000000
      real, parameter :: pi=3.141592654

      real MC,integ,x,y
      integer tid,OMP_GET_THREAD_NUM,i,j,init,inside  
      read*,init
      call OMP_SET_NUM_THREADS(init)
      call random_seed()
!$OMP PARALLEL DO PRIVATE(J,X,Y,INSIDE,MC) 
!$OMP& REDUCTION(+:INTEG)
      do i=1,n
         inside=0
         do j=1,m
           call random_number(x)
           call random_number(y)

           x=x*pi
           y=y*2.0

           if(y.le.x*sin(x))then
             inside=inside+1
           endif

         enddo

         MC=inside*2*pi/m
         integ=integ+MC/n
      enddo
!$OMP END PARALLEL DO

      print*, integ
      end

As I increase the number of threads, run-time increases drastically. 随着线程数量的增加,运行时间急剧增加。 I have looked for solutions for such problems and in most cases shared memory elements happen to be the problem but I cannot see how it is affecting my case. 我一直在寻找解决此类问题的方法,在大多数情况下,共享内存元素恰好是问题所在,但我看不到它如何影响我的情况。

I am running it on a 16 core processor using Intel Fortran compiler. 我正在使用Intel Fortran编译器在16核处理器上运行它。

EDIT: The program after adding implicit none , declaring all variables and adding the private clause 编辑:添加implicit none ,声明所有变量并添加private子句后的程序

You should not use RANDOM_NUMBER for high performance computing and definitely not in parallel threads. 您不应将RANDOM_NUMBER用于高性能计算,并且绝对不能在并行线程中使用。 There NO guarantees about the quality of the random number generator and about thread safety of the standard random number generator. 不能保证随机数发生器的质量和标准随机数发生器的线程安全性。 See Can Random Number Generator of Fortran 90 be trusted for Monte Carlo Integration? 请参阅是否可以将Fortran 90的随机数生成器用于蒙特卡洛积分?

Some compilers will use a fast algorithm that cannot be called in parallel. 一些编译器将使用不能并行调用的快速算法。 Some compilers will ave slow method but callable from parallel. 一些编译器将采用慢速方法,但可以并行调用。 Some will be both fast and allowed from parallel. 有些将既快速又允许并行运行。 Some will generate poor quality random sequences, some better. 有些会产生质量较差的随机序列,有些会更好。

You should use some parallel PRNG library. 您应该使用一些并行的PRNG库。 There are many. 有许多。 See here for recommendations for Intel https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/283349 I use library based on http://www.cmiss.org/openCMISS/wiki/RandomNumberGenerationWithOpenMP in my own slightly improved version https://bitbucket.org/LadaF/elmm/src/e732cb9bee3352877d09ae7f6b6722157a819f2c/src/simplevtk.f90?at=master&fileviewer=file-view-default but be careful, I don't care about the quality of the sequence in my applications, only about speed. 请参阅此处以获取有关英特尔https://software.intel.com/zh-cn/forums/intel-math-kernel-library/topic/283349的建议。我使用基于http://www.cmiss.org/openCMISS/的Wiki / RandomNumberGenerationWithOpenMP在我自己稍有改进的版本中https://bitbucket.org/LadaF/elmm/src/e732cb9bee3352877d09ae7f6b6722157a819f2c/src/simplevtk.f90?at=master&fileviewer=file-view-default但请注意,我不在乎我的应用程序中序列的质量,仅与速度有关。


To the old version: 到旧版本:

You have a race condition there. 你那里有比赛条件。

With

 inside=inside+1

more threads can be competing for writing and reading the variable. 更多线程可以竞争写入和读取变量。 You will have to somehow synchronize the access. 您将必须以某种方式同步访问。 If you make it reduction you will have problems with 如果reduction您将遇到问题

integ=integ+MC/n

if you make it private, then inside=inside+1 will only count locally. 如果将其设为私有,则inside=inside+1只会在本地进行计数。

MC also appears to be in a race condition, because more threads will be writing in it. MC也似乎处于竞争状态,因为将在其中写入更多线程。 It is not clear at all what MC does and why is it there, because you are not using the value anywhere. 完全不清楚MC作用以及在其中的原因,因为您没有在任何地方使用该值。 Are you sure the code you show is complete? 您确定显示的代码是否完整? If not, please see How to make a Minimal, Complete, and Verifiable example . 如果没有,请参见如何制作最小,完整和可验证的示例

See this With OpenMP parallelized nested loops run slow an many other examples how a race condition can make program slow. 使用OpenMP,并行嵌套循环运行缓慢 ,还有许多其他示例,说明竞争条件如何使程序变慢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM