简体   繁体   English

用于平铺矩阵乘法的 AVX 内在函数 [暂停]

[英]AVX intrinsics for tiled matrix multiplication [on hold]

I was trying to use AVX512 intrinsics to vectorize my loop of matrix multiplication (tiled).我试图使用 AVX512 内在函数来矢量化我的矩阵乘法循环(平铺)。 I used __mm256d as variables to store intermediate results and store them in my results.我使用 __mm256d 作为变量来存储中间结果并将它们存储在我的结果中。 However, somehow this triggers memory corruption.但是,不知何故,这会触发 memory 损坏。 I've got no hint why this is the case, as the non-AVX version works fine.我不知道为什么会这样,因为非 AVX 版本可以正常工作。 Also, another weird thing is that tile sizes somehow affects the result now.此外,另一个奇怪的事情是,瓷砖大小现在会以某种方式影响结果。

The matrix structs are attached in the following code section.矩阵结构附加在以下代码部分中。 The function takes two matrix pointers, m1 and m2 and an integer for tileSize.Thanks for @harold's feedback, I've now replaced the _mm256_load_pd for matrix m1 with broadcast. function 采用两个矩阵指针 m1 和 m2 以及一个用于 tileSize 的 integer。感谢@harold 的反馈,我现在已将矩阵 m1 的 _mm256_load_pd 替换为广播。 However, the memory corrupution problem still persist.但是,memory 损坏问题仍然存在。 I've also attached the output of memory corruption below我还在下面附上了 memory 损坏的 output


__m256d rResult rm1, rm2, rmult;


    for (int bi = 0; bi < result->row; bi += tileSize) {
         for (int bj = 0; bj < result->col; bj += tileSize) {
             for (int bk = 0; bk < m1->col; bk += tileSize) {
                 for (int i = 0; i < tileSize; i++ ) {
                     for (int j = 0; j < tileSize; j+=4) {
                         rResult = _mm256_setzero_pd();
                         for (int k = 0; k < tileSize; k++) {

                            //  result->val[bi+i][bj+j] += m1.val[bi+i][bk+k]*m2.val[bk+k][bj+j];


                             rm1 = _mm256_broadcast_pd((__m128d const *) &m1->val[bi+i][bk+k]);
                             rm2 = _mm256_load_pd(&m2->val[bk+k][bj+j]);
                             rmult = _mm256_mul_pd(rm1,rm2);
                             rResult = _mm256_add_pd(rResult,rmult);
                             _mm256_store_pd(&result->val[bi+i][bj+j],rResult);
                         }
                     }  
                 }
             }
         }
     }
return result;
*** Error in `./matrix': free(): invalid next size (fast): 0x0000000001880910 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81609)[0x2b04a26d0609]
./matrix[0x4016cc]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b04a2671495]
./matrix[0x400e29]
======= Memory map: ========
00400000-0040c000 r-xp 00000000 00:2c 6981358608                         /home/matrix
0060b000-0060c000 r--p 0000b000 00:2c 6981358608                         /home/matrix
0060c000-0060d000 rw-p 0000c000 00:2c 6981358608                         /home/matrix
01880000-018a1000 rw-p 00000000 00:00 0                                  [heap]
2b04a1f13000-2b04a1f35000 r-xp 00000000 00:16 12900                      /usr/lib64/ld-2.17.so
2b04a1f35000-2b04a1f3a000 rw-p 00000000 00:00 0
2b04a1f4e000-2b04a1f52000 rw-p 00000000 00:00 0
2b04a2134000-2b04a2135000 r--p 00021000 00:16 12900                      /usr/lib64/ld-2.17.so
2b04a2135000-2b04a2136000 rw-p 00022000 00:16 12900                      /usr/lib64/ld-2.17.so
2b04a2136000-2b04a2137000 rw-p 00000000 00:00 0
2b04a2137000-2b04a2238000 r-xp 00000000 00:16 13188                      /usr/lib64/libm-2.17.so
2b04a2238000-2b04a2437000 ---p 00101000 00:16 13188                      /usr/lib64/libm-2.17.so
2b04a2437000-2b04a2438000 r--p 00100000 00:16 13188                      /usr/lib64/libm-2.17.so
2b04a2438000-2b04a2439000 rw-p 00101000 00:16 13188                      /usr/lib64/libm-2.17.so
2b04a2439000-2b04a244e000 r-xp 00000000 00:16 12867                      /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2b04a244e000-2b04a264d000 ---p 00015000 00:16 12867                      /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2b04a264d000-2b04a264e000 r--p 00014000 00:16 12867                      /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2b04a264e000-2b04a264f000 rw-p 00015000 00:16 12867                      /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2b04a264f000-2b04a2811000 r-xp 00000000 00:16 13172                      /usr/lib64/libc-2.17.so
2b04a2811000-2b04a2a11000 ---p 001c2000 00:16 13172                      /usr/lib64/libc-2.17.so
2b04a2a11000-2b04a2a15000 r--p 001c2000 00:16 13172                      /usr/lib64/libc-2.17.so
2b04a2a15000-2b04a2a17000 rw-p 001c6000 00:16 13172                      /usr/lib64/libc-2.17.so
2b04a2a17000-2b04a2a1c000 rw-p 00000000 00:00 0
2b04a2a1c000-2b04a2a1e000 r-xp 00000000 00:16 13184                      /usr/lib64/libdl-2.17.so
2b04a2a1e000-2b04a2c1e000 ---p 00002000 00:16 13184                      /usr/lib64/libdl-2.17.so
2b04a2c1e000-2b04a2c1f000 r--p 00002000 00:16 13184                      /usr/lib64/libdl-2.17.so
2b04a2c1f000-2b04a2c20000 rw-p 00003000 00:16 13184                      /usr/lib64/libdl-2.17.so
2b04a4000000-2b04a4021000 rw-p 00000000 00:00 0
2b04a4021000-2b04a8000000 ---p 00000000 00:00 0
7ffc8448e000-7ffc844b1000 rw-p 00000000 00:00 0                          [stack]
7ffc845ed000-7ffc845ef000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
Aborted

That code loads a small row vector from m1 and a small row vector from m2 and multiplies them, which is not how matrix multiplication works, I assume it's direct vectorization of the identical scalar loop.该代码从 m1 加载一个小行向量,从 m2 加载一个小行向量并将它们相乘,这不是矩阵乘法的工作方式,我假设它是相同标量循环的直接向量化。 You can use a broadcast-load from m1, that way the product with the row vector from m2 results in a row vector of the result which is convenient (the other way around, broadcasting from m2, you get a column vector of the result which is tricky to store - unless of course you use the column-major matrix layout).您可以使用来自 m1 的广播负载,这样与来自 m2 的行向量的乘积会产生一个方便的结果的行向量(相反,从 m2 广播,您会得到一个结果的列向量存储起来很棘手 - 当然,除非您使用以列为主的矩阵布局)。

Never resetting rResult is also wrong, and takes extra care when using tiling, because the tiling means that individual results are put aside and then picked up again later.永远不要重置rResult也是错误的,并且在使用平铺时要格外小心,因为平铺意味着将单个结果放在一边,然后再重新拾取。 It's convenient to implement C += A*B because then you don't have to distinguish between the second time that a result is worked on (loading rResult back out of the result matrix) and the first time that a result is worked on (either zeroing the accumulator, or if you implement C += A*B , then it's also just loading it out of the result).实现C += A*B很方便,因为这样您就不必区分第二次处理结果(从结果矩阵中加载rResult )和第一次处理结果(或者将累加器归零,或者如果你实现C += A*B ,那么它也只是从结果中加载它)。

There are some performance bugs,有一些性能错误,

  • Using one accumulator.使用一个蓄能器。 This limits the inner loop to run once every 4 cycles (Skylake) in the long term, because of the loop-carried dependency through the addition (or FMA).这限制了内部循环在长期内每 4 个周期运行一次(Skylake),因为通过加法(或 FMA)的循环携带依赖。 There should be 2 FMAs per cycle but that way there would be one FMA every 4 cycles, 1/8th speed.每个周期应该有 2 个 FMA,但这样每 4 个周期就会有一个 FMA,速度为 1/8。
  • Using a 2:1 load-to-FMA ratio (assuming the mul+add is contracted), it needs to be 1:1 or better to avoid getting bottlenecked by load throughput.使用 2:1 的负载与 FMA 比率(假设 mul+add 已收缩),它需要为 1:1 或更好,以避免负载吞吐量成为瓶颈。 A 2:1 ratio is limited to half speed. 2:1 的比率仅限于半速。

The solution for both of them is multiplying a small column vector from m1 with a small row vector from m2 in the inner loop, summing into a small matrix of accumulators rather than just one of them.它们的解决方案是在内循环中将来自 m1 的小列向量与来自 m2 的小行向量相乘,求和成一个小累加器矩阵,而不仅仅是其中一个。 For example if you use a 3x16 region (3x4 vectors, with a vector length of 4 and the vectors corresponding to loads from m2, from m1 you would do broadcast-loads), then there are 12 accumulators, and therefore 12 independent dependency chains: enough to hide the high latency-throughput product of FMA (2 per cycle, but 4 cycles long on Skylake, so you need at least 8 independent chains, and at least 10 on Haswell).例如,如果您使用 3x16 区域(3x4 向量,向量长度为 4,向量对应于来自 m2 的加载,来自 m1 您将执行广播加载),则有 12 个累加器,因此有 12 个独立的依赖链:足以隐藏 FMA 的高延迟吞吐量产品(每个周期 2 个,但在 Skylake 上需要 4 个周期,因此您需要至少8 个独立链,在 Haswell 上至少需要 10 个)。 It also means there are 7 loads and 12 FMAs in the inner loop, even better than 1:1, it can even support Turbo frequencies without overclocking the cache.这也意味着内循环有7个负载和12个FMA,甚至优于1:1,它甚至可以支持Turbo频率而无需超频缓存。

I would also like to note that setting the tile size the same in every dimension is not necessarily the best.我还想指出,在每个维度上设置相同的图块大小不一定是最好的。 Maybe it is, but probably not, the dimensions do act a little differently.也许是,但可能不是,维度的作用确实有点不同。

More advanced performance issue,更高级的性能问题,

  • Not re-packing tiles.不重新包装瓷砖。 This means tiles will span more pages than necessary, which hurts the effectiveness of the TLB.这意味着图块将跨越不必要的页面,这会损害 TLB 的有效性。 You can easily get into a situation where your tiles fit in the cache, but are spread over too many pages to fit in the TLB.您很容易遇到这样的情况:您的切片适合缓存,但分布在太多页面上以适应 TLB。 TLB thrashing is not good. TLB 抖动不好。

Using asymmetric tile sizes you can arrange for either m1 tiles or m2 tiles to be TLB-friendly, but not both at the same time.使用不对称的 tile 大小,您可以安排 m1 tile 或 m2 tile 对 TLB 友好,但不能同时安排两者。

If you care about performance, normally you want one contiguous chunk of memory, not an array of pointers to rows.如果您关心性能,通常您需要一个连续的 memory 块,而不是指向行的指针数组。

Anyway, you're probably reading off the end of a row if your tile size isn't a multiple of 4 doubles per vector.无论如何,如果您的图块大小不是每个向量的 4 个双精度数的倍数,您可能正在读取行尾。 Or if your rows or cols aren't a multiple of the tile size, then you need to stop after the last full tile, and write cleanup code for the end.或者,如果您的行或列不是磁贴大小的倍数,那么您需要在最后一个完整磁贴之后停止,并为最后编写清理代码。

eg bi < result->row - (tileSize-1) for the outer loops例如bi < result->row - (tileSize-1)用于外部循环

If your tile size isn't a multiple of 4, then you'd also need i < tileSize-3 .如果您的图块大小不是 4 的倍数,那么您还需要i < tileSize-3 But hopefully you are doing power-of-2 loop tiling / cache blocking.但希望您正在执行 2 次幂循环平铺/缓存阻塞。 But you'd want a size - 3 boundary for vector cleanup in a partial tile.但是您需要一个size - 3的边界用于部分平铺中的矢量清理。 Then probably scalar cleanup for the last few elements.然后可能对最后几个元素进行标量清理。 (Or if you can use an unaligned final vector that ends at the end of a row, that can work, maybe with masked loads/stores. But trickier for matmul than for algorithms that just make a single pass.) (或者,如果您可以使用在行末尾结束的未对齐最终向量,则可以使用屏蔽加载/存储。但对于 matmul 比只进行一次传递的算法更棘手。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM