简体   繁体   English

是否可以在并行区域中创建共享2D数组的局部局部选择元素副本? (共享,私有,障碍:OPenMP)

[英]Can a thread-local copy of select elements be created of a shared 2D array in a parallel region? (Shared, private, barrier: OPenMP)

I have a 2-D grid of n x n elements. 我有n x n元素的二维网格。 In one iteration, I'm calculating the value of one element by averaging the values of its neighbors. 在一次迭代中,我通过平均一个邻居的值来计算一个元素的值。 That is: 那是:

    for(int i=0;i<n;i++)
        for(int j=0;j<n;j++)
            grid[i][j] = (grid[i-1][j] + grid[i][j-1] + grid[i+1][j] + grid[i][j+1])/4.0;

And I need to run the above nested loop for iter number of iterations. 我需要运行上面的嵌套循环iter迭代次数。 What I need is the following: 我需要的是以下内容:

  1. I need the threads to calculate this average, wait till all the threads have finished calculating and THEN update the grid in one go. 我需要线程来计算该平均值,等到所有线程都完成计算并然后一次更新网格。
  2. The loop with iter iterations will run sequentially, but during every iteration, the value of grid[i][j] for every i and j should be calculated in parallel. 具有iter迭代的循环将顺序运行,但每次迭代期间 ,应并行计算每个ijgrid[i][j]的值。

In order to do that I have the following ideas and questions: 为此,我有以下想法和问题:

  1. Maybe make grid shared and put a copy of the select 4 elements of the grid that is needed for calculating grid[i][j] by making only those 4 elements private to the thread. 也许可以共享网格,并通过仅将线程中的这4个元素设为私有来放置计算出grid[i][j]所需的网格中选择的4个元素的副本。 (Basically grid is shared by all threads, but there is a local copy of 4 iteration-specific elements in every thread too.) Is this possible? (基本上,所有线程都共享网格,但是每个线程中也有4 个特定迭代的元素的本地副本。) 这可能吗?
  2. Would a barrier be in fact needed for all the threads to finish and then start onto the next iteration? 实际上是否需要一个barrier才能使所有线程完成然后开始下一个迭代?

I'm very new to the OpenMP way of thinking and I'm utterly lost in this simple problem. 我对OpenMP的思维方式还很陌生,而我完全迷失在这个简单的问题中。 I'd be grateful if somebody could help resolve my confusion. 如果有人可以帮助解决我的困惑,我将不胜感激。

  1. In practice, you'd want to have (much) fewer threads than grid points, so each thread will be calculating a whole bunch of points (for example, one row). 在实践中,您想要的线程数比网格点数少得多,因此每个线程将计算一整束点(例如,一行)。 There is a certain overhead associated with starting OpenMP (or any other kind of) threads, and you program will be memory-bound rather than CPU-bound anyway. 启动OpenMP(或任何其他类型的)线程有一定的开销,并且您的程序将始终受内存限制,而不是受CPU限制。 So starting a thread per grid point will defeat the whole purpose of parallelizing the computation. 因此,为每个网格点启动一个线程将使并行化计算的全部目的无效。 Hence, your idea #1 is not recommended (I am not quite sure I understood it correctly though; maybe this is not what you were proposing). 因此,不建议您使用第一个想法(尽管我不确定我是否正确理解它;也许这不是您的建议)。

  2. I would recommend (also pointed out by others in OP comments) you allocate twice the memory needed to store the grid values and use two pointers that are swapped between iterations: one points to memory holding previous iteration values that are read only, the other one to new iteration values that are write-only. 我建议(其他人在OP注释中也指出),您应该分配两倍的存储网格值所需的内存,并使用两次在两次迭代之间交换的指针:一个指向保存只读的先前迭代值的内存,另一个指向存储只读值的内存。到只写的新迭代值。 Note that you will only swap the pointers, not actually copy the memory. 请注意,您将只交换指针,而不实际复制内存。 After your iteration is done, you can copy the final result into desired location. 迭代完成后,您可以将最终结果复制到所需的位置。

  3. Yes, you need to synchronize threads between iterations, however in OpenMP this is usually done implicitly simply by opening a parallel region within the iteration loop (there is an implicit barrier at the end of a parallel region): 是的,您需要在迭代之间同步线程,但是在OpenMP中,这通常可以通过在迭代循环中打开并行区域来隐式完成(在并行区域的末尾有一个隐式屏障):

     for (int iter = 0; iter < niter; ++iter) { #pragma omp parallel { // get range of points for current thread // loop over thread's points and apply the stencil } } 

    or, using a parallel for construct: 或者,使用parallel for构造:

     const int np = n*n; for (int iter = 0; iter < niter; ++iter) { #pragma omp parallel for for (int ip = 0; ip < np; ++ip) { const int i = ip / n; const int j = ip % n; // apply the stencil to [i,j] } } 

    The second version will auto-distribute the work evenly between the available threads, which is most likely what you want. 第二个版本将自动在可用线程之间平均分配工作,这很可能是您想要的。 In the first you have to do it manually. 首先,您必须手动进行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM