简体   繁体   English

有没有办法加速openMP

[英]Is there a way to speed-up openMP

I am solving a problem that compares execution times between a serial, an mpi and an openMP code.我正在解决一个比较串行、mpi 和 openMP 代码之间的执行时间的问题。 The problem is that the openMP version is slower than mpi.问题是openMP 版本比mpi 慢。 Is there a way to evolve the openMP code below to be faster than mpi?有没有办法让下面的openMP代码比mpi更快?

for(i=0;i<loop;i++)
  {
    #pragma omp parallel for private(k,dx,dy,dz,d,a) schedule(dynamic)
      for(j=0;j<N;j++)
      {
        for(k=0;k<N;k++)
        {
          if(j!=k)
          {
            dx=C[k*3+0]-C[j*3+0];
            dy=C[k*3+1]-C[j*3+1];
            dz=C[k*3+2]-C[j*3+2];

            d=sqrt(pow(dx,2)+pow(dy,2)+pow(dz,2));

            F[j*3+0]-=G*M[j]*M[k]/pow(d,3)*dx;
            F[j*3+1]-=G*M[j]*M[k]/pow(d,3)*dy;
            F[j*3+2]-=G*M[j]*M[k]/pow(d,3)*dz;
          }
        }
      }
      #pragma omp for schedule(dynamic)
        for(j=0;j<N;j++)
        {
          for(k=0;k<3;k++)
          {
            a=F[j*3+k]/M[j];
            V[j*3+k]=V[j*3+k]+a*Dt[i];
            C[j*3+k]=C[j*3+k]+V[j*3+k]*Dt[i];
          }
        }
  }

What this code do is that the outer loop is the times the process is going to take place and is also used in the Dt table in the end.这段代码的作用是,外循环是过程将要发生的时间,并且最终也用于Dt表中。 The second loop describes a mass that moves and the third calculates the forces that been pushing it from the other masses existing in the system.第二个循环描述了一个移动的质量,第三个循环计算了从系统中存在的其他质量推动它的力。 The two loop after that calculates the new position.之后的两个循环计算新位置。 With this in mind I can't move the parallelism in the outer loop because in every i circle a new updated C table needed.考虑到这一点,我无法在外循环中移动并行性,因为在每个i圈中都需要一个新的更新C表。 So is there anything to be changed so this code can run faster.那么有什么需要改变的,所以这段代码可以运行得更快。

For more info about the problem有关问题的更多信息

  • loop : takes value between 10.000 - 1,000,000,000 (provided from user) loop :取 10.000 - 1,000,000,000 之间的值(由用户提供)
  • N : takes values between 2 - 10 (provided from user) N :取 2 - 10 之间的值(由用户提供)
  • C : takes random values between min and max (provided from user) C : 取minmax之间的随机值(由用户提供)
  • F and V : initial values 0.00 FV : 初始值 0.00
  • G : 6.673e-11 G : 6.673e-11

Allocation of the tables表的分配

M=malloc(N*sizeof(int));

C=malloc(N*3*sizeof(float));

F=malloc(N*3*sizeof(float));
V=malloc(N*3*sizeof(float));

Dt=malloc(loop*sizeof(float));

Table values表值

for(i=0;i<N;i++)
{
  M[i]=rand()%(high-low+1)+low;
}

for(i=0;i<N*3;i++)
{
  C[i]=rand()%(max-min+1)+min;

  F[i]=0.0;
  V[i]=0.0;
}

for(i=0;i<loop;i++)
{
  Dt[i]=(float)rand()/RAND_MAX;
}

You may start by replacing schedule(dynamic) with schedule(static) .您可以先用schedule(static)替换schedule(dynamic) schedule(static) There is absolutely no need for dynamic scheduling here since the amount of work done by each iteration is constant.这里绝对不需要动态调度,因为每次迭代完成的工作量是恒定的。 schedule(dynamic) defaults to chunk size of 1 and dynamically assigns each iteration to some thread with the associated huge overhead. schedule(dynamic)默认为块大小1并动态地将每次迭代分配给具有相关巨大开销的某个线程。

Dynamic scheduling is useful when each iteration involves a varying amount of work, in which case static scheduling may lead to load imbalance and idling threads.当每次迭代涉及不同数量的工作时,动态调度很有用,在这种情况下,静态调度可能会导致负载不平衡和空闲线程。 A canonical case is colouring a fractal set.一个典型的案例是给分形集着色。 Even then, it is often more reasonable to dispatch work items in chunks of more than one iteration in order to minimise the dispatch overhead.即便如此,为了最小化调度开销,以不止一次迭代的块调度工作项通常更为合理。

The second loop is not running in parallel since what you have there is an orphaned OpenMP for construct instead of a combined parallel for one.第二个循环不是并行运行的,因为你有一个孤立的 OpenMP for构造而不是一个组合parallel for You also need to make k and a private.您还需要将ka设为私有。


Now that we know that N is really small and loop takes such large values, there are some things that can be done to improve the performance.既然我们知道N真的很小,并且loop采用了如此大的值,那么可以采取一些措施来提高性能。

First, there is no such thing as "calling omp parallel once to start the parallelism".首先,没有“调用omp parallel一次以启动并行性”之类的东西。 There are parallel regions that execute in parallel whenever the flow control passes through them.每当流控制通过它们时,就会有并行执行的并行区域。 A parallel region is a block of code following the OpenMP parallel construct.并行区域是遵循 OpenMP parallel结构的代码块。 OpenMP worksharing constructs such as for only execute in parallel when inside the dynamic scope of a parallel region. OpenMP工作构建体,例如for仅在平行当并行区域的动态范围内执行。 The second loop is therefore not parallel since it is only a for construct and is not nested lexically or dynamically in a parallel region.因此,第二个循环不是并行的,因为它只是一个for构造并且没有在parallel区域中按词法或动态嵌套。

To make the terminology clear, lexical nesting of an OpenMP construct inside a parallel region means:为了使术语清晰,在并行区域内的 OpenMP 构造的词法嵌套意味着:

#pragma omp parallel
{
   #pragma omp for
   for (...) {}
}

and dynamic nesting means:和动态嵌套意味着:

foo() {
  #pragma omp for
  for (...) {}
}

#pragma omp parallel
{
  foo();
}

Just calling foo() from outside a parallel region will not make the loop run in parallel.仅从并行区域外部调用foo()不会使循环并行运行。

There are shorthand combined constructs such as parallel for for when the only code in the body of a parallel region is a worksharing construct such as for .当并行区域主体中的唯一代码是工作共享结构(例如 for )时,有速记组合结构(例如parallel for for

Second, parallel regions are not for free.其次,平行区域不是免费的。 OpenMP follows the fork/join model of parallel computation where the program executes sequentially until the flow of execution encounters a parallel region. OpenMP 遵循并行计算的 fork/join 模型,其中程序按顺序执行,直到执行流遇到并行区域。 When that happens, a fork of worker threads occurs and the program starts to execute in parallel.发生这种情况时,会出现工作线程的分支,并且程序开始并行执行。 At the end of the parallel region, the worker threads are joined back into the main thread and the program continues to execute sequentially.在并行区的端部,工作线程接合回主线程,程序继续顺序地执行。 Forking and joining have their price in terms of execution time.分叉和加入在执行时间方面有其价格。 Although practically all modern OpenMP runtimes use thread pools and only the very first parallel region activation is really slow due to the time it takes the OS to spawn all the worker threads, the fork/join overhead is still not negligible.尽管实际上所有现代 OpenMP 运行时都使用线程池,并且由于操作系统需要时间来生成所有工作线程,因此只有第一个并行区域激活确实很慢,但 fork/join 开销仍然不可忽略。 Therefore, it is meaningless to use OpenMP unless there is enough work to be distributed between the threads so that the overhead can be amortised.因此,除非有足够的工作可以在线程之间分配以便可以分摊开销,否则使用 OpenMP 是没有意义的。

Here is an illustration of the problem.这是问题的说明。 Four iterations, each taking one time unit, computed sequentially and in parallel with two threads.四次迭代,每次使用一个时间单位,用两个线程顺序和并行计算。 The overhead for both fork and join is two time units: fork 和 join 的开销是两个时间单位:

|    sequential                 parallel
|  +------------+     +-------------------------+
|  |    it.0    |     |          fork           |
|  |    it.1    |     |        overhead         |
|  |    it.2    |     |    it.0    |    it.2    |
|  |    it.3    |     |    it.1    |    it.3    |
|  +------------+     |          join           |
|                     |        overhead         |
|                     +-------------------------+
v  time

Although dividing the iterations between two threads make the computation twice as fast, the overhead makes the parallel version slower overall.尽管在两个线程之间划分迭代使计算速度提高一倍,但开销使并行版本总体上变慢。

The same, but now with ten iterations:相同,但现在有十次迭代:

|    sequential                 parallel
|  +------------+     +-------------------------+
|  |    it.0    |     |          fork           |
|  |    it.1    |     |        overhead         |
|  |    it.2    |     |    it.0    |    it.5    |
|  |    it.3    |     |    it.1    |    it.6    |
|  |    it.4    |     |    it.2    |    it.7    |
|  |    it.5    |     |    it.3    |    it.8    |
|  |    it.6    |     |    it.4    |    it.9    |
|  |    it.7    |     |          join           |
|  |    it.8    |     |        overhead         |
|  |    it.9    |     +-------------------------+
|  +------------+
|
v  time

Clearly, the parallel version is now faster and will get even faster the more iterations there are, approaching asymptotically from below a speedup of 2x.显然,并行版本现在速度更快,并且迭代次数越多,速度越快,从低于 2 倍的加速比逐渐逼近。 Note that the problem here is not that there are only four iterations in the first case, but that those iterations take only one time unit each.请注意,这里的问题不是在第一种情况下只有四次迭代,而是这些迭代每次只需要一个时间单位。 It is fine to use OpenMP for a problem with small number of iterations but large amount of computational time per iteration.对于迭代次数少但每次迭代计算时间长的问题,可以使用 OpenMP。

The problem in the first case can be greatly exacerbated if the parallel region is inside an outer loop that with many iterations, which is exactly your case.如果并行区域位于具有多次迭代的外部循环内,则第一种情况下的问题可能会大大加剧,这正是您的情况。 The canonical solution is to move the outer loop inside the parallel region.规范的解决方案是在平行区域内移动外环。 This way, there will be a single fork and a single join and the overhead will not get replicated.这样,将有一个分叉和一个连接,并且开销不会被复制。 With your code, something like this:使用您的代码,如下所示:

#pragma omp parallel private(i,k,dx,dy,dz,d,a)
for(i=0;i<loop;i++)
{
   #pragma omp parallel for schedule(static)
   for(j=0;j<N;j++)
   {
      for(k=0;k<N;k++)
      {
         if(j!=k)
         {
            dx=C[k*3+0]-C[j*3+0];
            dy=C[k*3+1]-C[j*3+1];
            dz=C[k*3+2]-C[j*3+2];

            d=sqrt(pow(dx,2)+pow(dy,2)+pow(dz,2));

            F[j*3+0]-=G*M[j]*M[k]/pow(d,3)*dx;
            F[j*3+1]-=G*M[j]*M[k]/pow(d,3)*dy;
            F[j*3+2]-=G*M[j]*M[k]/pow(d,3)*dz;
         }
      }
   }

   #pragma omp for schedule(static)
   for(j=0;j<N;j++)
   {
      for(k=0;k<3;k++)
      {
         a=F[j*3+k]/M[j];
         V[j*3+k]=V[j*3+k]+a*Dt[i];
         C[j*3+k]=C[j*3+k]+V[j*3+k]*Dt[i];
      }
   }
}

You have to be very careful now because the entire loop is inside the parallel region and each thread is executing all iterations, ie, there is no distribution of iterations.您现在必须非常小心,因为整个循环都在并行区域内,并且每个线程都在执行所有迭代,即没有迭代分布。 There is no worksharing directive applied to the i -loop and therefore i must be given explicitly the private treatment.没有应用于i循环的工作共享指令,因此必须明确给予i private处理。 A better coding style would have all private variables declared inside the parallel region, in which case there will be no need for a private clause at all, but this is not done here for demonstration reasons.更好的编码风格是在并行区域内声明所有私有变量,在这种情况下根本不需要private子句,但出于演示原因,这里没有这样做。

Because the i -loop iterations are not independent from one another, you have to make sure that all threads are doing them in lock-step.因为i循环迭代不是相互独立的,所以您必须确保所有线程都在锁步执行它们。 This is usually achieved with barrier synchronisation, which in the code above comes from the implicit barriers at the end of the for constructs.这通常是通过屏障同步来实现的,在上面的代码中,它来自for结构末尾的隐式屏障。 The same applies to different stages inside the iteration.这同样适用于迭代中的不同阶段。 Again, here the second worksharing construct does not start before the previous one has finished due to the implicit barrier at the end of the latter.同样,这里的第二个工作共享结构不会在前一个工作共享结构完成之前开始,因为后者结束时存在隐式障碍。

The very frist thing to do is to set up a time measurement and check if this snippent is your hotspot.要做的第一件事是设置时间测量并检查此片段是否是您的热点。 (I think you already did this, since you compared openMP to mpi) (我认为您已经这样做了,因为您将 openMP 与 mpi 进行了比较)

Here are my thoughts:以下是我的想法:

  • Make all local variables local.将所有局部变量设为局部。
  • Check if the dynamic scheduling really helps you检查动态调度是否真的对您有帮助
  • Reduce superfluous calculations pow(sqrt(x),3) = pow(x,1.5)减少多余的计算 pow(sqrt(x),3) = pow(x,1.5)
  • if N is not too large, I suggest that you once pre calculate M_once[j+N*k]=M[j]*M[k] .如果N不是太大,我建议你预先计算一次M_once[j+N*k]=M[j]*M[k] If j==k you can set M_once to zero and thus save time for the branching (if statement).如果j==k您可以将 M_once 设置为零,从而节省分支时间(if 语句)。

Here is the code:这是代码:

for(int i=0;i<loop;i++) {
    #pragma omp parallel for
      for(int j=0;j<N;j++)
      {
        for(int k=0;k<N;k++)
        {
          if(j!=k)
          {
            const double dx=C[k*3+0]-C[j*3+0];
            const double dy=C[k*3+1]-C[j*3+1];
            const double dz=C[k*3+2]-C[j*3+2];

            const double d=pow(dx,2)+pow(dy,2)+pow(dz,2);

            const double factor = G*M[j]*M[k]/pow(d,1.5); 

            F[j*3+0] -= factor * dx;
            F[j*3+1] -= factor * dy;
            F[j*3+2] -= factor * dz;
          }
        }
      }
      #pragma omp for 
        for(int j=0;j<N;j++)
        {
          for(int k=0;k<3;k++)
          {
            const double a = F[j*3+k] / M[j];
            V[j*3+k] = V[j*3+k] + a * Dt[i];
            C[j*3+k] += V[j*3+k] * Dt[i];
          }
        }
  }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM