使用 AVX 的平铺矩阵乘法

Question

I have coded the following C function for multiplying two NxN matrices using tiling/blocking and AVX vectors to speed up the calculation.我编写了以下 C 函数，用于使用平铺/分块和 AVX 向量将两个 NxN 矩阵相乘以加快计算速度。 Right now though I'm getting a segmentation fault when I try to combine AVX intrinsics with tiling.现在，当我尝试将 AVX 内在函数与平铺相结合时，我遇到了分段错误。 Any idea why that happens?知道为什么会这样吗？

Also, is there a better memory access pattern for matrix B?另外，矩阵 B 是否有更好的内存访问模式？ Maybe transposing it first or even changing the k and j loop?也许先调换它甚至改变 k 和 j 循环？ Because right now, I'm traversing it column-wise which is probably not very efficient in regards to spatial locality and cache lines.因为现在，我正在按列遍历它，这在空间局部性和缓存行方面可能不是很有效。

  1 void mmult(double A[SIZE_M][SIZE_N], double B[SIZE_N][SIZE_K], double C[SIZE_M][SIZE_K])
  2 {
  3   int i, j, k, i0, j0, k0;
  4   // double sum;
  5   __m256d sum;
  6   for(i0 = 0; i0 < SIZE_M; i0 += BLOCKSIZE) {
  7   for(k0 = 0; k0 < SIZE_N; k0 += BLOCKSIZE) {
  8   for(j0 = 0; j0 < SIZE_K; j0 += BLOCKSIZE) {
  9       for (i = i0; i < MIN(i0+BLOCKSIZE, SIZE_M); i++) {
 10         for (j = j0; j < MIN(j0+BLOCKSIZE, SIZE_K); j++) {
 11           // sum = C[i][j];
 12           sum = _mm256_load_pd(&C[i][j]);
 13           for (k = k0; k < MIN(k0+BLOCKSIZE, SIZE_N); k++) {
 14             // sum += A[i][k] * B[k][j];
 15             sum = _mm256_add_pd(sum, _mm256_mul_pd(_mm256_load_pd(&A[i][k]), _mm256_broadcast_sd(&B[k][j])));
 16           }
 17           // C[i][j] = sum;
 18           _mm256_store_pd(&C[i][j], sum);
 19         }
 20       }
 21   }
 22   }
 23   }
 24 }

Answer 1

_mm256_load_pd is an alignment-required load but you're only stepping by k++ , not k+=4 in the inner-most loop that loads a 32-byte vector of 4 doubles. _mm256_load_pd是需要对齐的加载，但在加载 4 个双精度值的 32 字节向量的最内层循环中，您只是步进k++ ，而不是k+=4 。 So it faults because 3 of every 4 loads are misaligned.因此它会出错，因为每 4 个负载中有 3 个未对齐。

You don't want to be doing overlapping loads, your real bug is the indexing;你不想做重叠加载，你真正的错误是索引； if your input pointers are 32-byte aligned you should be able to keep using _mm256_load_pd instead of _mm256_loadu_pd .如果您的输入指针是 32 字节对齐的，您应该能够继续使用_mm256_load_pd而不是_mm256_loadu_pd 。 So using _mm256_load_pd successfully caught your bug instead of working but giving numerically wrong results.因此，使用_mm256_load_pd成功地捕获了您的错误，而不是工作但给出了数字错误的结果。

Your strategy for vectorizing four row*column dot products (to produce a C[i][j+0..3] vector) should load 4 contiguous doubles from 4 different columns ( B[k][j+0..3] via a vector load from B[k][j] ), and broadcast 1 double from A[i][k] .您对四行row*column点积进行矢量化的策略（以生成C[i][j+0..3]向量）应该从 4 个不同的列（ B[k][j+0..3] ）加载 4 个连续的双打通过来自B[k][j] ) 的矢量负载，并从A[i][k]广播 1 双。 Remember you want 4 dot products in parallel.请记住，您需要并行的 4 个点积。

Another strategy might involve a horizontal sum at the end down to a scalar C[i][j] += horizontal_add(__m256d) , but I think that would require transposing one input first so both row and column vectors are in contiguous memory for one dot product.另一种策略可能涉及最后的水平总和到标量C[i][j] += horizontal_add(__m256d) ，但我认为这需要先转置一个输入，以便行和列向量都在连续内存中点积。 But then you need shuffles for a horizontal sum at the end of each inner loop.但是随后您需要在每个内循环结束时对水平总和进行洗牌。

You probably also want to use at least 2 sum variables so you can read a whole cache line at once, and hide FMA latency in the inner loop and hopefully bottleneck on throughput.您可能还希望使用至少 2 个sum变量，以便您可以一次读取整个缓存行，并隐藏内部循环中的 FMA 延迟，并希望对吞吐量造成瓶颈。 Or better do 4 or 8 vectors in parallel.或者更好地并行执行 4 或 8 个向量。 So you produce C[i][j+0..15] as sum0 , sum1 , sum2 , sum3 .所以你产生C[i][j+0..15]作为sum0 ， sum1 ， sum2 ， sum3 。 (Or use an array of __m256d ; compilers will typically fully unroll a loop of 8 and optimize the array into registers.) （或使用__m256d数组；编译器通常会完全展开 8 个循环并将数组优化为寄存器。）

I think you only need 5 nested loops, to block over rows and columns.我认为您只需要 5 个嵌套循环来阻止行和列。 Although apparently 6 nested loops are a valid option: see loop tiling/blocking for large dense matrix multiplication which has a 5-nested loop in the question but a 6-nested loop in an answer.尽管显然 6 个嵌套循环是一个有效选项：请参阅大型密集矩阵乘法的循环平铺/阻塞，它在问题中有 5 个嵌套循环，但在答案中有 6 个嵌套循环。 (Just scalar, though, not vectorized). （不过，只是标量，而不是矢量化）。

There might be other bugs besides the row*column dot product strategy here, I'm not sure.除了这里的行*列点积策略之外，可能还有其他错误，我不确定。

If you're using AVX, you might want to use FMA as well, unless you need to run on Sandbybridge/Ivybridge, and AMD Bulldozer.如果您使用 AVX，您可能还想使用 FMA，除非您需要在 Sandbybridge/Ivybridge 和 AMD Bulldozer 上运行。 (Piledriver and later have FMA3). （打桩机和更高版本有 FMA3）。

Other matmul strategies include adding into the destination inside the inner loop so you're loading C and A inside the inner loop, with a load from B hoisted.其他 matmul 策略包括在内循环内添加目标，以便在内循环内加载C和A ，同时提升来自B的负载。 (Or B and A swapped, I forget.) What Every Programmer Should Know About Memory? （或者 B 和 A 交换了，我忘记了。）每个程序员应该知道什么关于内存？ has a vectorized cache-blocked example that works this way in an appendix, for SSE2 __m128d vectors.对于 SSE2 __m128d向量，有一个在附录中以这种方式工作的向量化缓存阻塞示例。 https://www.akkadia.org/drepper/cpumemory.pdf https://www.akkadia.org/drepper/cpumemory.pdf

使用 AVX 的平铺矩阵乘法

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-11-23 19:44:01

使用 AVX 的平铺矩阵乘法

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-11-23 19:44:01

解决方案1
2 已采纳 2019-11-23 19:44:01