SSE矩阵-矩阵乘法

Question

I'm having trouble doing matrix-matrix multiplication with SSE in C.我在用 C 中的 SSE 进行矩阵-矩阵乘法时遇到问题。

Here is what I got so far:这是我到目前为止所得到的：

#define N 1000

void matmulSSE(int mat1[N][N], int mat2[N][N], int result[N][N]) {
  int i, j, k;
  __m128i vA, vB, vR;

  for(i = 0; i < N; ++i) {
    for(j = 0; j < N; ++j) {
        vR = _mm_setzero_si128();
        for(k = 0; k < N; k += 4) {
            //result[i][j] += mat1[i][k] * mat2[k][j];
            vA = _mm_loadu_si128((__m128i*)&mat1[i][k]);
            vB = _mm_loadu_si128((__m128i*)&mat2[k][j]); //how well does the k += 4 work here? Should it be unrolled?
            vR = _mm_add_epi32(vR, _mm_mul_epi32(vA, vB));
        }
        vR = _mm_hadd_epi32(vR, vR);
        vR = _mm_hadd_epi32(vR, vR);
        result[i][j] += _mm_extract_epi32(vR, 0);
    }
  }
}

I can't seem to make it give the correct results.我似乎无法让它给出正确的结果。 Am I missing something?我错过了什么吗？ And searching dosent seem to help much - every result is either only doing 4x4 matrices, mat-vec or some special magic thats not very readable and hard to understand...搜索 dosent 似乎有很大帮助 - 每个结果要么只做 4x4 矩阵，mat-vec 或一些不太可读且难以理解的特殊魔法......

Answer 1

You're right, your vB is the problem.你是对的，你的vB是问题所在。 You're loading 4 consecutive integers, but mat2[k+0..3][j] aren't contiguous.您正在加载 4 个连续的整数，但mat2[k+0..3][j]不连续。 You're actually getting mat2[k][j+0..3] .你实际上得到了mat2[k][j+0..3] 。

I forget what works well for matmul.我忘了什么对 matmul 有效。 Sometimes it works well to produce 4 results in parallel, instead of doing a horizontal sum for every result.有时并行产生 4 个结果效果很好，而不是对每个结果进行水平求和。

Transposing one of your input matrices works, and costs O(N^2).转置您的输入矩阵之一有效，并且成本为 O(N^2)。 It's worth it because it means the O(N^3) matmul can use sequential accesses, and your current loop structure becomes SIMD-friendly.这是值得的，因为这意味着 O(N^3) matmul 可以使用顺序访问，并且您当前的循环结构变得对 SIMD 友好。

There are even better ways, such as transposing small blocks right before use, so they're still hot in L1 cache when you read them again.还有更好的方法，比如在使用前转置小块，这样当你再次读取它们时，它们在 L1 缓存中仍然很热。 Or looping over a destination row and adding in one result, instead of accumulating a full result for a single or small set of row*column dot products.或者循环遍历目标行并添加一个结果，而不是为单个或一小组行*列点积累积完整结果。 Cache blocking, aka loop tiling, is one key to good matmul performance.缓存阻塞，又名循环平铺，是良好 matmul 性能的关键之一。 See also What Every Programmer Should Know About Memory?另请参阅每个程序员应该了解的关于内存的内容？ which has a cache-blocked SIMD FP matmul example in an appendix without a transpose.它在没有转置的附录中有一个缓存阻塞的 SIMD FP matmul 示例。

Much has been written about optimizing matrix multiplies, with SIMD and with cache-blocking.关于优化矩阵乘法、使用 SIMD 和缓存阻塞的文章很多。 I suggest you google it up.建议你google一下。 Most if it is probably talking about FP, but it all applies to integer as well.大多数可能是在谈论 FP，但它也适用于整数。

(Except that SSE/AVX only has FMA for FP, not for 32-bit integers, and the 8 and 16-bit input PMADD instructions do horizontal adds of pairs.) （除了 SSE/AVX 只有 FP 的 FMA，没有 32 位整数，并且 8 位和 16 位输入 PMADD 指令执行对的水平加法。）

Actually I think you can produce 4 results in parallel here, if one input has been transposed already :实际上我认为你可以在这里并行产生 4 个结果，如果一个输入已经被转置了：

void matmulSSE(int mat1[N][N], int mat2[N][N], int result[N][N]) {

  for(int i = 0; i < N; ++i) {
    for(int j = 0; j < N; j+=4) {   // vectorize over this loop
        __m128i vR = _mm_setzero_si128();
        for(int k = 0; k < N; k++) {   // not this loop
            //result[i][j] += mat1[i][k] * mat2[k][j];
            __m128i vA = _mm_set1_epi32(mat1[i][k]);  // load+broadcast is much cheaper than MOVD + 3 inserts (or especially 4x insert, which your new code is doing)
            __m128i vB = _mm_loadu_si128((__m128i*)&mat2[k][j]);  // mat2[k][j+0..3]
            vR = _mm_add_epi32(vR, _mm_mullo_epi32(vA, vB));
        }
        _mm_storeu_si128((__m128i*)&result[i][j], vR));
    }
  }
}

A broadcast-load (or separate load+broadcast without AVX) is still much cheaper than a gather.广播加载（或单独的加载+广播，没有 AVX）仍然比收集便宜得多。

Your current code does the gather with 4 inserts, instead of breaking the dependency chain on the previous iteration's value by using a MOVD for the first element, so that's even worse.您当前的代码使用 4 个插入进行收集，而不是通过对第一个元素使用 MOVD 来破坏对上一次迭代值的依赖链，因此情况更糟。 But even the best gather of 4 scattered elements is pretty bad compared to a load + PUNPCKLDQ.但与负载 + PUNPCKLDQ 相比，即使是最好的 4 个分散元素的集合也很糟糕。 Not to mention that that makes your code need SSE4.1.更不用说这使您的代码需要 SSE4.1。

Although it needs SSE4.1 anyway for _mm_mullo_epi32 instead of the widening PMULDQ ( _mm_mul_epi32 ) .尽管_mm_mullo_epi32无论如何它都需要 SSE4.1 而不是加宽的PMULDQ ( _mm_mul_epi32 ) 。

Note that integer multiply throughput is generally worse than FP multiply, especially on Haswell and later.请注意，整数乘法吞吐量通常比 FP 乘法差，尤其是在 Haswell 及更高版本上。 FP FMA units only have 24-bit wide multipliers per 32-bit element (for FP mantissas) so using those for 32x32=>32-bit integer requires splitting into two uops. FP FMA 单元每个 32 位元素只有 24 位宽的乘法器（对于 FP 尾数），因此使用那些用于 32x32=>32 位整数的乘法器需要分成两个 uops。

Answer 2

如果转置第二个矩阵，则第一种方法是正确的

Answer 3

This first version was posted by the OP as an edit to the question where it doesn't belong.第一个版本由 OP 发布，作为对不属于它的问题的编辑。 Moved it to a community-wiki answer just for posterity.将其移至社区维基答案，仅供后人使用。

That first version is absolute garbage for performance, the worst possible way to vectorize, doing an hsum down to scalar inside the inner-most loop, and doing a manual gather with insert_epi32 not even a 4x4 transpose.第一个版本对于性能来说绝对是垃圾，最糟糕的矢量化方式，在最内部循环中将 hsum 降为标量，并使用insert_epi32进行手动收集，甚至不是 4x4 转置。

Update: Woho!更新：哇！ I finally figured it out.我终于想通了。 Besides the errors in my logic (thanks for the help Peter Cordes) there was also the issue of _mm_mul_epi32() not working as I thought it did - I should've been using _mm_mullo_epi32() instead!除了我的逻辑错误（感谢 Peter Cordes 的帮助）之外，还有_mm_mul_epi32()没有像我想象的那样工作 - 我应该使用_mm_mullo_epi32()来代替！

I know this is not the most effective code, but it was made to get it to work properly first - now I can move on to optimizing it.我知道这不是最有效的代码，但它是为了让它首先正常工作 - 现在我可以继续优化它。

( Note, don't use this, it's very very slow ) （注意，不要使用这个，它非常非常慢）

// editor's note: this is the most naive and horrible way to vectorize
void matmulSSE_inefficient(int mat1[N][N], int mat2[N][N], int result[N][N]) {
    int i, j, k;
    __m128i vA, vB, vR, vSum;

    for(i = 0; i < N; ++i) {
        for(j = 0; j < N; ++j) {
            vR = _mm_setzero_si128();
            for(k = 0; k < N; k += 4) {
                //result[i][j] += mat1[i][k] * mat2[k][j];
                vA = _mm_loadu_si128((__m128i*)&mat1[i][k]);
                          // less braindead would be to start vB with movd, avoiding a false dep and one shuffle uop.
                          // vB = _mm_cvtsi32_si128(mat2[k][j]);   // but this manual gather is still very bad
                vB = _mm_insert_epi32(vB, mat2[k][j], 0);     // false dependency on old vB
                vB = _mm_insert_epi32(vB, mat2[k + 1][j], 1);  // bad spatial locality
                vB = _mm_insert_epi32(vB, mat2[k + 2][j], 2);  // striding down a column
                vB = _mm_insert_epi32(vB, mat2[k + 3][j], 3);
                vR = _mm_mullo_epi32(vA, vB);
                vR = _mm_hadd_epi32(vR, vR);                // very slow inside the inner loop
                vR = _mm_hadd_epi32(vR, vR);
                result[i][j] += _mm_extract_epi32(vR, 0);

                //DEBUG
                //printf("vA: %d, %d, %d, %d\n", vA.m128i_i32[0], vA.m128i_i32[1], vA.m128i_i32[2], vA.m128i_i32[3]);
                //printf("vB: %d, %d, %d, %d\n", vB.m128i_i32[0], vB.m128i_i32[1], vB.m128i_i32[2], vB.m128i_i32[3]);
                //printf("vR: %d, %d, %d, %d\n", vR.m128i_i32[0], vR.m128i_i32[1], vR.m128i_i32[2], vR.m128i_i32[3]);
                //printf("\n");
            }
        }
    }
}

End of extremely inefficient code originally written by the OP最初由 OP 编写的极其低效的代码结束

Update 2: converted the OP's example to an ikj loop order version.更新 2：将 OP 的示例转换为 ikj 循环顺序版本。 Required an extra load for vR and moving the store into the inner loop, but setting vA could be moved up a loop.需要额外的 vR 负载并将存储移动到内循环中，但设置 vA 可以向上移动一个循环。 Turned out faster.结果更快。

// this is significantly better but doesn't do any cache-blocking
void matmulSSE_v2(int mat1[N][N], int mat2[N][N], int result[N][N]) {
    int i, j, k;
    __m128i vA, vB, vR;

    for(i = 0; i < N; ++i) {
        for(k = 0; k < N; ++k) {
            vA = _mm_set1_epi32(mat1[i][k]);
            for(j = 0; j < N; j += 4) {
                //result[i][j] += mat1[i][k] * mat2[k][j];
                vB = _mm_loadu_si128((__m128i*)&mat2[k][j]);
                vR = _mm_loadu_si128((__m128i*)&result[i][j]);
                vR = _mm_add_epi32(vR, _mm_mullo_epi32(vA, vB));
                _mm_storeu_si128((__m128i*)&result[i][j], vR);

                //DEBUG
                //printf("vA: %d, %d, %d, %d\n", vA.m128i_i32[0], vA.m128i_i32[1], vA.m128i_i32[2], vA.m128i_i32[3]);
                //printf("vB: %d, %d, %d, %d\n", vB.m128i_i32[0], vB.m128i_i32[1], vB.m128i_i32[2], vB.m128i_i32[3]);
                //printf("vR: %d, %d, %d, %d\n", vR.m128i_i32[0], vR.m128i_i32[1], vR.m128i_i32[2], vR.m128i_i32[3]);

                //printf("\n");
            }
        }
    }
}

These assume N is a multiple of 4, the vector width这些假设 N 是 4 的倍数，向量宽度

If that's not the case, often it's easier to still pad your array storage to a multiple of the vector width, so there's padding at the end of each row and you can just use that simple j < N; j += 4如果不是这种情况，通常将数组存储填充到向量宽度的倍数通常会更容易，因此每行末尾都有填充，您可以使用简单的j < N; j += 4 j < N; j += 4 loop condition. j < N; j += 4循环条件。 You'll want to keep track of the real N size separately from the storage layout with a row stride that's a multiple of 4 or 8.您需要使用 4 或 8 的倍数的行步长来跟踪与存储布局分开的实际N大小。

Otherwise you want a loop condition like j < N-3 ;否则你需要一个循环条件，比如j < N-3 ; j += 4`, and a scalar cleanup for the end of a row. j += 4`，以及行尾的标量清理。

Or masking or keeping the last full vector in a register so you can _mm_alignr_epi8 with a maybe-overlapping final vector that ends at the end of the row, and maybe do a vector store.或者将最后一个完整向量屏蔽或保留在寄存器中，以便您可以_mm_alignr_epi8使用可能重叠的最终向量，该向量在行的末尾结束，并且可以进行向量存储。 This is easier with AVX or especially AVX512 masking.使用 AVX 或特别是 AVX512 掩蔽更容易。

SSE矩阵-矩阵乘法

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-10-30 06:55:55

解决方案2
0 2019-11-17 17:12:18

解决方案3
0

SSE矩阵-矩阵乘法

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-10-30 06:55:55

解决方案2 0 2019-11-17 17:12:18

解决方案3 0

解决方案1
2 已采纳 2016-10-30 06:55:55

解决方案2
0 2019-11-17 17:12:18

解决方案3
0