简体   繁体   English

为什么此代码的并行执行比顺序执行慢?

[英]Why is parallel execution slower than sequential for this code?

#pragma omp parallel
{
 for (i=1; i<1024; i++)
  #pragma omp for
  for (j=1; j<1024; j++)
   A[i][j] = 2*A[i-1][j];
}

I'm using 12 threads to execute this code. 我正在使用12个线程来执行此代码。 Any suggestions what I must do to speed up? 有什么建议我必须加快速度吗?

Assuming A's type is smaller than 64Bytes, trying to parallelize the inner loop this way would most likely cause you to have false sharing in cache lines. 假设A的类型小于64Byte,尝试以这种方式并行化内部循环很可能会导致您在缓存行中共享错误。

Say A is an aligned array of 4-byte ints, you would have A[i][0] through A[i][15] in the same cache line. 假设A是4字节整数的对齐数组,则在同一缓存行中将具有A [i] [0]至A [i] [15]。 This means that all 12 threads would attempt to read the line simultaneously, each for the part it needs, this may have resulted in having the line shared between the multiple cores if you left it at that, but you also try to write is back, leading each core to attempt to take ownership on the line in order to modify it. 这意味着所有12个线程都将尝试同时读取该行,每个线程都需要读取该行,如果您将其留在那,则可能导致该行在多个内核之间共享,但是您还尝试写回去,导致每个核心尝试拥有所有权以进行修改。

CPU caches are usually based on MESI based protocols, making a store attempt issue a read-for-ownership that would invalidate the line in each other core except the requester. CPU缓存通常基于基于MESI的协议,使存储尝试发出所有权的读取,这将使除请求者之外的每个其他内核中的行无效。 Issuing 12 parallel (or rather 6 if you have 6 core * 2 threads each) would result in a race where the first one winning the line may very well have it preempted from him by a snoop before it even had a chance to modify it (although that's not likely). 发出12个并行信号(如果您有6个核心* 2个线程,则发出6个并行信号)将导致比赛,其中第一个赢得该线路的人很可能在没有机会对其进行修改之前就抢先一步抢走了他(虽然不太可能)。 The result is quite messy, and may take a while before the line travels to each core in its turn, gets modified, and then snooped out by another core. 结果是非常混乱的,并且可能需要一段时间才能将生产线依次转到每个核心,然后进行修改,然后被另一个核心侦听。 This recurred for each of the next consecutive groups of 16 elements (again, assuming int). 下一个连续的16个元素组中的每个重复出现一次(同样,假定为int)。

What you might do is: 您可能会做的是:

  • make sure that each individual thread is working on its own cacheline, but adding an internal loop that runs over the required number of elements per line, and parallelizing the loops that skips over this number of elements. 确保每个单独的线程都在其自己的高速缓存行上工作,但是要添加一个内部循环,该循环运行每行所需数量的元素,并并行化跳过该数量元素的循环。

This however would prevent you from reaching the full potential of the CPU as you lose the spatial locality and the streaming property of your code. 但是,这将使您无法充分利用CPU的潜力,因为您会丢失代码的空间局部性和流属性。 Instead, you may: 相反,您可以:

  • Parallelize the outer loop so that each thread works over a few lines, thereby allowing it to own the entire consecutive stream of memory. 并行化外循环,以使每个线程可以占用几行,从而使其拥有整个连续的内存流。 However, since you need ordering between the lines, you may have to do a little tweaking here (for eg a transpose). 但是,由于需要在行之间进行排序,因此您可能需要在此处进行一些调整(例如,转置)。

There's still a downside here, since if a thread encounters too many streams it might loose track of them. 这里仍然有一个缺点,因为如果线程遇到太多的流,可能会失去对它们的跟踪。 A third approach is, therefore - 因此,第三种方法是-

  • Tile the array - break it into sets of, say, 48 lines, distribute them between the threads so that each runs on a few full lines (the transpose trick still applies here, btw), and then continue to the next group 平铺该数组-将其分成48行,在线程之间分配它们,以便每个线程都在几行上运行(转置技巧仍然适用于此处,顺便说一句),然后继续进行下一组

1) how many cores do you have? 1)您拥有几个核心? You can't get any more parallelism speedup than that, and as others said, probably a lot less. 您无法获得比这更多的并行化速度,而且正如其他人所说的那样,可能要少得多。

2) it looks like the inner index j should start at 0, not 1. 2)看起来内部索引j应该从0开始,而不是1。

3) That inner loop is crying out for pointers and unrolling, as in 3)内循环正在为指针和展开呐喊,就像

double* pa = &A[i][0];
double* pa1 = &A[i-1][0];
for (j = 0; j < 1024; j += 8){
    *pa++ = 2 * *pa1++;
    *pa++ = 2 * *pa1++;
    *pa++ = 2 * *pa1++;
    *pa++ = 2 * *pa1++;
    *pa++ = 2 * *pa1++;
    *pa++ = 2 * *pa1++;
    *pa++ = 2 * *pa1++;
    *pa++ = 2 * *pa1++;
}

or... 要么...

double* pa = &A[i][0];
double* paEnd = &A[i][1024];
double* pa1 = &A[i-1][0];
for (; pa < paEnd; pa += 8, pa1 += 8){
    pa[0] = 2 * pa1[0];
    pa[1] = 2 * pa1[1];
    pa[2] = 2 * pa1[2];
    pa[3] = 2 * pa1[3];
    pa[4] = 2 * pa1[4];
    pa[5] = 2 * pa1[5];
    pa[6] = 2 * pa1[6];
    pa[7] = 2 * pa1[7];
}

whichever is faster. 以较快者为准。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM