简体   繁体   English


[英]Speed up matrix-matrix multiplication using SSE vector instructions

I have some trouble in vectorize some C code using SSE vector instructions. 我在使用SSE矢量指令对某些C代码进行矢量化时遇到了一些麻烦。 The code which I have to victorize is 我必须取胜的代码是

#define N 1000
void matrix_mul(int mat1[N][N], int mat2[N][N], int result[N][N])
   int i, j, k;
   for (i = 0; i < N; ++i)
      for (j = 0; j < N; ++j)
         for (k = 0; k < N; ++k)
              result[i][k] += mat1[i][j] * mat2[j][k];

Here is what I got so far: 这是到目前为止我得到的:

void  matrix_mul_sse(int mat1[N][N], int mat2[N][N], int result[N][N])
   int i, j, k; int* l;
   __m128i v1, v2, v3;
   v3 = _mm_setzero_si128();
   for (i = 0; i < N; ++i)
       for (j = 0; j < N; j += 4)

           for (k = 0; k < N; k += 4)

               v1 = _mm_set1_epi32(mat1[i][j]);
               v2 = _mm_loadu_si128((__m128i*)&mat2[j][k]);
               v3 = _mm_add_epi32(v3, _mm_mul_epi32(v1, v2));
               _mm_storeu_si128((__m128i*)&result[i][k], v3);
               v3 = _mm_setzero_si128();

After execution I got wrong result. 执行后,我得到了错误的结果。 I know that the reason is the loading from memory to v2. 我知道原因是从内存加载到v2。 I loop through mat1 in row major order so I need to load mat2[0][0], mat2[1][0], mat2[2][0], mat2[3][0].... but what actually loaded is mat2[0][0], mat2[0][1], mat2[0][2], mat2[0][3]... because mat2 has stored in the memory in row major order. 我以行主要顺序遍历mat1,因此我需要加载mat2 [0] [0],mat2 [1] [0],mat2 [2] [0],mat2 [3] [0] ....但是实际加载的是mat2 [0] [0],mat2 [0] [1],mat2 [0] [2],mat2 [0] [3] ...,因为mat2已按行主顺序存储在内存中。 I tried to fix this problem but without any improvement. 我试图解决此问题,但没有任何改善。 Can anyone help me please. 谁能帮我。

Below fixed your implementation: 下面修复了您的实现:

void  matrix_mul_sse(int mat1[N][N], int mat2[N][N], int result[N][N])
   int i, j, k;
   __m128i v1, v2, v3, v4; 
   for (i = 0; i < N; ++i)
       for (j = 0; j < N; ++j) // 'j' must be incremented by 1
           // read mat1 here because it does not use 'k' index
           v1 = _mm_set1_epi32(mat1[i][j]); 
           for (k = 0; k < N; k += 4)
               v2 = _mm_loadu_si128((const __m128i*)&mat2[j][k]);

               // read what's in the result array first as we will need to add it later to our calculations
               v3 = _mm_loadu_si128((const __m128i*)&result[i][k]);

               // use _mm_mullo_epi32 here instead _mm_mul_epi32 and add it to the previous result
               v4 = _mm_add_epi32(v3, _mm_mullo_epi32(v1, v2));

               // store the result
               _mm_storeu_si128((__m128i*)&result[i][k], v4);

In short _mm_mullo_epi32 (requires SSE4.1) produces 4 x int32 results as opposed to _mm_mul_epi32 which does 2 x int64 results. 简而言之, _mm_mullo_epi32 (需要SSE4.1)产生4 x int32结果,而_mm_mul_epi32则产生2 x int64结果。 If you cannot use SSE4.1 then have a look at the answer here for an alternative SSE2 solution. 如果您不能使用SSE4.1,请在此处查看替代SSE2解决方案的答案。

Full description by Intel Intrinsic Guide : 英特尔内部指南的完整描述:

_mm_mullo_epi32: Multiply the packed 32-bit integers in a and b, producing intermediate 64-bit integers, and store the low 32 bits of the intermediate integers in dst. _mm_mullo_epi32:将a和b中的压缩32位整数相乘,生成中间64位整数,并将中间整数的低32位存储在dst中。

_mm_mul_epi32: Multiply the low 32-bit integers from each packed 64-bit element in a and b, and store the signed 64-bit results in dst. _mm_mul_epi32:将a和b中每个压缩的64位元素的低32位整数相乘,并将带符号的64位结果存储在dst中。

I kinda changed around your code to make the addressing explicit [ it helps in this case ]. 我对您的代码进行了一些更改,以使寻址变得明确[在这种情况下有帮助]。

#define N 100

This is a stub for the vector unit multiple & accumulate operation; 这是向量单位多路累加运算的存根; you should be able to replace NV with whatever throw your vector unit has, and put the relevant opcodes in here. 您应该能够用向量单元所抛出的任何距离替换NV,并将相关的操作码放入此处。

#define NV 8
int Vmacc(int *A, int *B) {
   int i = 0;
   int x = 0;
   for (i = 0; i < NV; i++) {
        x += *A++ * *B++;
    return x;

This multiply has two notable variations from the norm: 1. It caches the columnar vector into a contiguous one. 该乘法与标准相比有两个显着的变化:1.将列向量矢量缓存到一个连续的向量中。 2. It attempts to push slices of the multiply accumulate into a vector-like func. 2.它试图将乘积的切片推入类似矢量的函数中。 Even without using the vector unit, this takes half the time of naive version just because of better cache/prefetch utilization. 即使不使用向量单元,这也只花费了朴素版本一半的时间,这仅仅是因为更好的缓存/预取利用率。

void mm2(int *A, int *B, int n, int *C) {
    int c, r;
    int stride = 0;
    int cache[N];
    for (c = 0; c < n; c++) {
        /* cache cumn i: */
        for (r = 0; r < n; r++) {
            cache[r] = B[c + r*n];
        for (r = 0; r < n; r++) {
            int k = 0;
            int x = 0;
            int *Av = A + r*n;
            for (k = 0; k+NV-1 < n; k += NV) {
                x += Vmacc(Av+k, cache+k);
            while (k < n) {
                x += Av[k] * cache[k];
            C[r*n + c] = x;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM