简体   繁体   English

openMP for循环增量语句处理

[英]openMP for loop increment statment handling

for (uint i = 0; i < x; i++) {
   for (uint j = 0; j < z; j++) {
           if (inFunc(p, index)) {
                XY[2*nind] = i;
                XY[2*nind + 1] = j;
                nind++;
           }
   }
}

here x = 512 and z = 512 and nind = 0 initially and XY[2*x*y]. 这里x = 512且z = 512且最初nind = 0且XY [2 * x * y]。

I want to optimize this for loops with openMP but 'nind' variable is closely binded serially to for loop. 我想用openMP优化这个循环,但'nind'变量紧密绑定到for循环。 I have no clue because I am also checking a condition and so some of the time it will not enter in if and will skip increment or it will enter increment nind. 我没有任何线索,因为我也在检查一个条件,所以有些时候它不会进入,如果并且将跳过增量或者它将进入增量nind。 openMP threads will increment nind variable as first come will increment nind firstly. openMP线程将增加nind变量,因为先来将首先增加nind。 Is there any way to unbind it. 有没有办法解开它。 ('binding' I mean only can be implemented serially). ('绑定'我的意思是只能串行实现)。

A typical cache-friendly solution in that case is to collect the (i,j) pairs in private arrays, then concatenate those private arrays at the end, and finally sort the result if needed: 在这种情况下,典型的缓存友好解决方案是在私有数组中收集(i,j)对,然后在最后连接这些私有数组,最后在需要时对结果进行排序:

#pragma omp parallel
{
  uint myXY[2*z*x];
  uint mynind = 0;

  #pragma omp for collapse(2) schedule(dynamic,N)
  for (uint i = 0; i < x; i++) {
    for (uint j = 0; j < z; j++) {
      if (inFunc(p, index)) {
        myXY[2*mynind] = i;
        myXY[2*mynind + 1] = j;
        mynind++;
      }
    }
  }

  #pragma omp critical(concat_arrays)
  {
    memcpy(&XY[2*nind], myXY, 2*mynind*sizeof(uint));
    nind += mynind;
  }
}

// Sort the pairs if needed
qsort(XY, nind, 2*sizeof(uint), compar);

int compar(const uint *p1, const uint *p2)
{
   if (p1[0] < p2[0])
     return -1;
   else if (p1[0] > p2[0])
     return 1;
   else
   {
     if (p1[1] < p2[1])
       return -1;
     else if (p1[1] > p2[1])
       return 1;
   }
   return 0;
}

You should experiment with different values of N in the schedule(dynamic,N) clause in order to achieve the best trade-off between overhead (for small values of N ) and load imbalance (for large values of N ). 您应该在schedule(dynamic,N)子句中尝试不同的N值,以便在开销(对于较小的N值)和负载不平衡(对于较大的N值)之间实现最佳权衡。 The comparison function compar could probably be written in a more optimal way. 比较函数compar也许可以写在一个更好的方法。

The assumption here is that the overhead from merging and sorting the array is small. 这里的假设是合并和排序数组的开销很小。 Whether that will be the case depends on many factors. 是否会出现这种情况取决于许多因素。

Here is a variation on Hristo Iliev's good answer. 以下是Hristo Iliev的一个很好的答案。

The important parameter to act on here is the index of the pairs rather than the pairs themselves. 在这里采取行动的重要参数是对的索引而不是对本身。

We can fill private arrays of the pair indices in parallel for each thread. 我们可以为每个线程并行填充对索引的私有数组。 The arrays for each thread will be sorted (irrespective of the scheduling). 将对每个线程的数组进行排序(与调度无关)。

The following function merges two sorted arrays 以下函数合并两个已排序的数组

void merge(int *a, int *b, int*c, int na, int nb) {
    int i=0, j=0, k=0;
    while(i<na && j<nb) c[k++] = a[i] < b[j] ? a[i++] : b[j++];
    while(i<na) c[k++] = a[i++];
    while(j<nb) c[k++] = b[j++];
}

Here is the remaining code 这是剩下的代码

uint nind = 0;
uint *P;
#pragma omp parallel
{
    uint myP[x*z];
    uint mynind = 0;
    #pragma omp for schedule(dynamic) nowait
    for(uint k = 0 ; k < x*z; k++) {
        if (inFunc(p, index)) myP[mynind++] = k;
    }
    #pragma omp critical
    {
        uint *t = (uint*)malloc(sizeof *P * (nind+mynind));
        merge(P, myP, t, nind, mynind);
        free(P);
        P = t;
        nind += mynind;
    }
}

Then given an index k in P the pair is (k/z, k%z) . 然后给出P的索引k ,该对是(k/z, k%z)

The merging can be improved. 合并可以改善。 Right now it goes at O(omp_get_num_threads()) but it could be done in O(log2(omp_get_num_threads())) . 现在它在O(omp_get_num_threads())但它可以在O(log2(omp_get_num_threads())) I did not bother with this. 我没有理会这个。


Hristo Iliev's pointed out that dynamic scheduling does not guarantee that the iterations per thread increase monotonically. Hristo Iliev指出,动态调度并不能保证每个线程的迭代单调增加。 I think in practice they are but it's not guaranteed in principle. 我认为在实践中它们只是原则上不保证。

If you want to be 100% sure that the iterations increase monotonically you can implement dynamic scheduling by hand . 如果您想100%确定迭代单调增加,您可以手动实现动态调度

The code you provide looks like you are trying to fill the XY data in sequential order. 您提供的代码看起来像是在尝试顺序填充XY数据。 In this case OMP multithreading is probably not the tool for the job as threads (in a best case) should avoid communication as much as possible. 在这种情况下,OMP多线程可能不是工作的工具,因为线程(在最好的情况下)应尽可能避免通信。 You could introduce an atomic counter, but then again, it is probably going to be faster just doing it sequentially. 你可以引入一个原子计数器,但是再一次,它可能会更快,只是按顺序执行。

Also what do you want to achieve by optimizing it? 您还希望通过优化它来实现什么? The x and z are not too big, so I doubt that you will get a substantial speed increase even if you reformulate your problem in a parallel fashion. x和z不是太大,所以我怀疑即使你以平行的方式重新解决你的问题,你也会获得相当大的速度提升。

If you do want parallel execution - map your indexes to the array, eg (not tested, but should do) 如果你想要并行执行 - 将索引映射到数组,例如(未测试,但应该)

#pragma omp parallel for shared(XY)
for (uint i = 0; i < x; i++) {
   for (uint j = 0; j < z; j++) {
           if (inFunc(p, index)) {
                uint idx = (2 * i) * x + 2 * j; 
                XY[idx] = i;
                XY[idx + 1] = j;
           }
   }
}

However, you will have gaps in your array XY then. 但是,您的阵列XY中会有间隙。 Which may or may not be a problem for you. 这对您来说可能是也可能不是问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM