简体   繁体   中英

speed up Matrix Multiplication by SSE2

I want to know how speed up matrix multiplication by SSE2

here is my code

int mat_mult_simd(double *a, double *b, double *c, int n)
{
   __m128d c1,c2,a1,a2,b1;

   for(int i=0; i<n/2; i++){
      for(int j=0; j<n/2; j++){
          c1 = _mm_load_pd(c+(2*j*n)+(i+2));
          c2 = _mm_load_pd(c+n+(2*j*n)+(i+2));
          for(int k=0; k<n; k++){
             a1 = _mm_load1_pd(a+k+(2*j*n));
             a2 = _mm load1_pd(a+n+k+(2*j*n));
             b1 = _mm_load_pd(b+(k*n)+(i*2));
             c1 = _mm_add_pd(c1, _mm_mul_pd(a1,b1));
             c2 = _mm_add_pd(c2, _mm_mul_pd(a2,b1));
          }
          __mm_store_pd(c+(2*j*n)+(i+2), c1);
          __mm_store_pd(c+n+(2*j*n)+(i+2), c2);
      }
   }
   return 0;
}

each parameter means

'a' = vector a(MAT_SIZE*MAT_SIZE)

'b' = vector b(MAT_SIZE*MAT_SIZE)

'c' = vector c(MAT_SIZE*MAT_SIZE)

'n' = MAT_SIZE is constant (It always even and >=2)

this code speed up about X4. against

int mat_mult_default(double *a, double *b, double *c, int n)
{
 double t;
 for(int i=0; i<n; i++){
    for(int j=0; j<n; j++){
    t=0.0;
    for(int k=0; k<n; k++)
       t += a[i*n+k] * b[k*n+j];
    c[i*n+j] = t;
    }
 }
}

but I want to more speed up. I usually experiment MAT_SIZE 1000*1000 or 2000*2000. how can i speed up? Is there other way to indexing? I really want to know. thanks.

You can do a few things. The obvious one is splitting the work into several threads (1 per core). You can use OpenMP (easiest), Intel TBB or other multithreading libs. This will provide a significant improvement on a multi-core machine.

Another thing is to looks at the disassembly (via your favorite debugger) - look how the compiler handles all the multiplications you use for the indexes, some of them can be eliminated.

Your code does 2 computations in one loop, try to do more 4 or 8 to have better locality. Eg a1 and a2 couldbe calculated with their neighbors who are already in the L1 cache. You can actually loads them with a single load operation.

Make sure the various arrays are SSE aligned (16 Byte) and change your code to use aligned reads/writes.

I'd leave multithreading to the end as finding bugs is harder.

Just use the right library like the Intel Math Kernel Library or a similar highly optimized linear algebra package (OpenBLAS, AMD Core Math Library, ATLAS, ...). They are considered faster compared hand-written code. They have sometimes even processor-specific optimizations for instruction sets and cache sizes. And they are professionals in their field. Unless you plan to publish a paper on your own optimization, the go with the library.

In the latest issue of the German computer magazine c't they claim the compiler is smart enough to use SSE or AVX by itself. Just write the right loops and the auto-vectorizer will bring the best results. This is true for the latest Intel compiler. Microsoft's compiler are too dump. In some cases with the right compiler flags, Intel's compiler even detects that you program a matrix multiplication and replaces this by the right call. Or you have to check the documentation, it is not that hard to learn such a package.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM