Multiply rows of the matrix by a vector (low-level optimization)?

Question

I'm optimizing a function and I want to get rid of slow for loops. I'm looking for a faster way to multiply each row of a matrix by a vector.

I'm not looking for a 'classical' multiplication.

Eg. I have a matrix that has 1024 columns and 20 rows and a vector that has the length of 1024. In a result, I want to have matrix 1024 x 20 that has each row multiplied by the vector.

What I am doing now I am iterating in the for loop over the matrix rows and using mkl v?Mul performing an element by element multiplication of current matrix row and the vector. Any ideas how to improve this?

The question is the copy Multiply rows of matrix by vector? but for C++ with possible low-level optimizations and MKL, not for R

Answer 1

Using the Eigen matrix library , what you are doing is essentially multiplying by a diagonal matrix. If you have a matrix of arbitrary many rows and 20 columns, you can write the following (not really worth making a function for that):

void multRows(Eigen::Matrix<double, Eigen::Dynamic, 20>& mat,
              const Eigen::Matrix<double,20,1>& vect)
{
    mat = mat * vect.asDiagonal();
}

Eigen does generate AVX2 code if it is enabled by the compiler. You may want to experiment if it is more efficient to store mat row major or column major in your use case.

Addendum (due to edited question): If you have (much) more than 20 columns, you should just use dynamic sized matrices all-together:

void multRows(Eigen::MatrixXd& mat, const Eigen::VectorXd& vect)
{
    mat = mat * vect.asDiagonal();
}

Answer 2

Most of the recent processors support AVX technology. It provides a vector containing 4 doubles (256-bit registers). Thus, a solution for this optimization might be using AVX. For this, I've implemented it using x86intrin.h library which is a part of GCC compiler. I also used OpenMP to make the solution multi-threaded.

//gcc -Wall  -fopenmp -O2 -march=native -o "MatrixVectorMultiplication" "MatrixVectorMultiplication.c" 
//gcc 7.2, Skylake Corei7-6700 HQ
//The performance improvement is significant (5232 Cycle in my machine) but MKL is not available to test
#include <stdio.h>
#include <x86intrin.h>
double A[20][1024] __attribute__(( aligned(32))) = {{1.0, 2.0, 3.0, 3.5, 1.0, 2.0, 3.0, 3.5}, {4.0, 5.0, 6.0, 6.5,4.0, 5.0, 6.0, 6.5},{7.0, 8.0, 9.0, 9.5, 4.0, 5.0, 6.0, 6.5 }};//The 32 is for 256-bit registers of AVX
double B[1024]  __attribute__(( aligned(32))) = {2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0 }; //the vector
double C[20][1024] __attribute__(( aligned(32)));//the results are stored here

int main()
{
    int i,j;
    __m256d vec_C1, vec_C2, vec_C3, vec_C4;

    //begin_rdtsc
    //get the start time here
    #pragma omp parallel for
    for(i=0; i<20;i++){
        for(j=0; j<1024; j+=16){

            vec_C1 = _mm256_mul_pd(_mm256_load_pd(&A[i][j]), _mm256_load_pd(&B[j]));
            _mm256_store_pd(&C[i][j], vec_C1);

            vec_C2 = _mm256_mul_pd(_mm256_load_pd(&A[i][j+4]), _mm256_load_pd(&B[j+4]));
            _mm256_store_pd(&C[i][j+4], vec_C2);

            vec_C3 = _mm256_mul_pd(_mm256_load_pd(&A[i][j+8]), _mm256_load_pd(&B[j+8]));
            _mm256_store_pd(&C[i][j+8], vec_C3);

            vec_C4 = _mm256_mul_pd(_mm256_load_pd(&A[i][j+12]), _mm256_load_pd(&B[j+12]));
            _mm256_store_pd(&C[i][j+12], vec_C4);

        }
    }
    //end_rdtsc
    //calculate the elapsead time

    //print the results
    for(i=0; i<20;i++){
        for(j=0; j<1024; j++){
            //printf(" %lf", C[i][j]);
        }
        //printf("\n");
    }

    return 0;
}

Multiply rows of the matrix by a vector (low-level optimization)?

Question

2 answers

solution1
2 2017-09-23 18:53:29

solution2
2 2017-09-30 13:41:04

Multiply rows of the matrix by a vector (low-level optimization)?

Question

2 answers

solution1 2 2017-09-23 18:53:29

solution2 2 2017-09-30 13:41:04

solution1
2 2017-09-23 18:53:29

solution2
2 2017-09-30 13:41:04