简体   繁体   English

使用内在函数向量化矩阵乘法的加法部分?

[英]Vectorizing addition part of matrix multiplication using intrinsics?

I'm trying to vectorize matrix multiplication using blocking and vector intrinsics. 我正在尝试使用分块和向量内在函数向量化矩阵乘法。 It seems to me that the addition part in the vector multiplication cannot be vectorized. 在我看来,向量乘法中的加法部分无法向量化。 Could you please see if I can improve my code to vectorize further? 您能否看一下我是否可以改进代码以进一步向量化?

    double dd[4], bb[4];
    __m256d op_a, op_b, op_d;
    for(i = 0; i < num_blocks; i++){
        for(j = 0; j < num_blocks; j++){
            for(k = 0; k < num_blocks; k++){
                for(ii = 0; ii < block_size ; ii++){
                    for(kk = 0; kk < block_size; kk++){
                        for(jj = 0; jj < block_size ; jj+=4){

                            aoffset=n*(i*block_size+ii)+j*block_size +jj ;
                            boffset=n*(j*block_size+jj)+k*block_size +kk;
                            coffset=n*(i*block_size+ii)+ k*block_size + kk;

                            bb[0]=b[n*(j*block_size+jj)+k*block_size +kk];
                            bb[1]=b[n*(j*block_size+jj+1)+k*block_size +kk];
                            bb[2]=b[n*(j*block_size+jj+2)+k*block_size +kk];
                            bb[3]=b[n*(j*block_size+jj+3)+k*block_size +kk];

                            op_a = _mm256_loadu_pd (a+aoffset);
                            op_b= _mm256_loadu_pd (bb);
                            op_d = _mm256_mul_pd(op_a, op_b);
                            _mm256_storeu_pd (dd, op_d);
                            c[coffset]+=(dd[0]+dd[1]+dd[2]+dd[3]);

                        }
                    }
                }
            }
        }
    }

Thanks. 谢谢。

You can use this version of matrix multiplication (c[i,j] = a[i,k]*b[k,j]) algorithm (scalar version): 您可以使用以下版本的矩阵乘法(c [i,j] = a [i,k] * b [k,j])算法(标量版本):

for(int i = 0; i < i_size; ++i)
{
    for(int j = 0; j < j_size; ++j)
         c[i][j] = 0;

    for(int k = 0; k < k_size; ++k)
    {
         double aa = a[i][k];
         for(int j = 0; j < j_size; ++j)
             c[i][j] += aa*b[k][j];
    }
}

And vectorized version: 和向量化版本:

for(int i = 0; i < i_size; ++i)
{
    for(int j = 0; j < j_size; j += 4)
         _mm256_store_pd(c[i] + j, _mm256_setzero_pd());

    for(int k = 0; k < k_size; ++k)
    {
         __m256d aa = _mm256_set1_pd(a[i][k]);
         for(int j = 0; j < j_size; j += 4)
         {
             _mm256_store_pd(c[i] + j, _mm256_add_pd(_mm256_load_pd(c[i] + j), _mm256_mul_pd(aa, _mm256_load_pd(b[k] + j))));
         }
    }
}

"Horizontal add" is a more recent edition to the SSE instruction set, so you can't use the accelerated version if compatibility with many different processors is your goal. “水平添加”是SSE指令集的最新版本,因此,如果要实现与许多不同处理器的兼容性,则不能使用加速版本。

However, you definitely can vectorize the additions. 但是,您绝对可以对添加的向量进行矢量化处理。 Note that the inner loop only affects a single coffset . 注意,内部循环仅影响单个coffset You should move the coffset calculation outward (the compiler will do this automatically, but the code is more readable if you do) and also use four accumulators in the innermost loop, performing the horizontal add only once per coffset . 您应该向外移动coffset计算(编译器将自动执行此操作,但如果这样做,则代码更具可读性),并且在最里面的循环中使用四个累加器,每个coffset仅执行一次水平coffset This is an improvement even if vector horizontal add is used, and for the scalar horizontal add, it's pretty big. 即使使用向量水平加法,这也是一个改进,对于标量水平加法,它相当大。

Something like: 就像是:

for(kk = 0; kk < block_size; kk++){
    op_e = _mm256_setzero_pd();

    for(jj = 0; jj < block_size ; jj+=4){
        aoffset=n*(i*block_size+ii)+j*block_size +jj ;
        boffset=n*(j*block_size+jj)+k*block_size +kk;

        bb[0]=b[n*(j*block_size+jj)+k*block_size +kk];
        bb[1]=b[n*(j*block_size+jj+1)+k*block_size +kk];
        bb[2]=b[n*(j*block_size+jj+2)+k*block_size +kk];
        bb[3]=b[n*(j*block_size+jj+3)+k*block_size +kk];

        op_a = _mm256_loadu_pd (a+aoffset);
        op_b= _mm256_loadu_pd (bb);
        op_d = _mm256_mul_pd(op_a, op_b);
        op_e = _mm256_add_pd(op_e, op_d);
    }
    _mm256_storeu_pd(dd, op_e);
    coffset = n*(i*block_size+ii)+ k*block_size + kk;
    c[coffset] = (dd[0]+dd[1]+dd[2]+dd[3]);
}

You can also speed this up by doing a transpose on b beforehand, instead of gathering the vector inside the innermost loop. 您也可以通过预先在b上进行转置来加快速度,而不是将矢量收集在最里面的循环内。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM