Using Intel oneAPI MKL to perform sparse matrix with dense vector multiplication

Question

I am developing a program, making heavy use of Armadillo library. I have the 10.8.2 version, linked against Intel oneAPI MKL 2022.0.2. At some point, I need to perform many sparse matrix times dense vector multiplications, both of which are defined using Armadillo structures. I have found this point to be a probable bottleneck, and was being curious if replacing the Armadillo multiplication with "bare bones" sparse CBLAS routines from MKL ( mkl_sparse_d_mv ) would speed things up. But in order to do so, I need to convert from Armadillo's SpMat to something that MKL understands. As per Armadillo docs, sparse matrices are stored in CSC format, so I have tried mkl_sparse_d_create_csc . My attempt at this is below:

#include <iostream>
#include <armadillo>
#include "mkl.h"

int main()
{

    arma::umat locations = {{0, 0, 1, 3, 2},{0, 1, 0, 2, 3}};
    // arma::vec vals = {0.5, 2.5, 2.5, 4.5, 4.5};
    arma::vec vals = {0.5, 2.5, 3.5, 4.5, 5.5};
    arma::sp_mat X(locations, vals);
    std::cout << "X = \n" << arma::mat(X) << std::endl;
    arma::vec v = {1,1,1,1};
    arma::vec v2;
    v2.resize(4);

    std::cout << "v = \n" << v << std::endl;
    std::cout << "X * v = \n" << X * v << std::endl;

    MKL_INT *cols_beg = static_cast<MKL_INT *>(mkl_malloc(X.n_cols * sizeof(MKL_INT), 64));
    MKL_INT *cols_end = static_cast<MKL_INT *>(mkl_malloc(X.n_cols * sizeof(MKL_INT), 64));
    MKL_INT *row_idx = static_cast<MKL_INT *>(mkl_malloc(X.n_nonzero * sizeof(MKL_INT), 64));
    double *values = static_cast<double *>(mkl_malloc(X.n_nonzero * sizeof(double), 64));

    for (MKL_INT i = 0; i < X.n_cols; i++)
    {
        cols_beg[i] = static_cast<MKL_INT>(X.col_ptrs[i]);
        cols_end[i] = static_cast<MKL_INT>((--X.end_col(i)).pos());
        std::cout << cols_beg[i] << " --- " << cols_end[i] << std::endl;
    }
    std::cout << std::endl;
    for (MKL_INT i = 0; i < X.n_nonzero; i++)
    {
        row_idx[i] = static_cast<MKL_INT>(X.row_indices[i]);
        values[i] = X.values[i];
        std::cout << row_idx[i] << " --- " << values[i] << std::endl;
    }
    std::cout << std::endl;
    sparse_matrix_t X_mkl = NULL;
    sparse_status_t res = mkl_sparse_d_create_csc(&X_mkl, SPARSE_INDEX_BASE_ZERO,
                                                  X.n_rows, X.n_cols, cols_beg, cols_end, row_idx, values);

    if(res == SPARSE_STATUS_SUCCESS) std::cout << "Constructed mkl representation of X" << std::endl;
    matrix_descr dsc;
    dsc.type = SPARSE_MATRIX_TYPE_GENERAL;

    sparse_status_t stat =  mkl_sparse_d_mv(SPARSE_OPERATION_NON_TRANSPOSE, 1.0, X_mkl, dsc, v.memptr(), 0.0, v2.memptr());
    std::cout << "Multiplication status = " << stat << std::endl;
    if(stat == SPARSE_STATUS_SUCCESS) 
    {
        std::cout << "Calculated X*v via mkl" << std::endl;
        std::cout << v2;
    }

    mkl_free(cols_beg);
    mkl_free(cols_end);
    mkl_free(row_idx);
    mkl_free(values);
    mkl_sparse_destroy(X_mkl);

    return 0;
}

I am compiling this code with (with the help of Link Line Advisor) icpc -g testing.cpp -o intel_testing.out -DARMA_ALLOW_FAKE_GCC -O3 -xhost -Wall -Wextra -L${MKLROOT}/lib/intel64 -liomp5 -lpthread -lm -DMKL_ILP64 -qmkl=parallel -larmadillo on Pop._OS 21.10. It compiles and runs without any problems: The output is as follows:

X = 
   0.5000   2.5000        0        0
   3.5000        0        0        0
        0        0        0   5.5000
        0        0   4.5000        0

v = 
   1.0000
   1.0000
   1.0000
   1.0000

X * v = 
   3.0000
   3.5000
   5.5000
   4.5000

0 --- 1
2 --- 2
3 --- 3
4 --- 4

0 --- 0.5
1 --- 3.5
0 --- 2.5
3 --- 4.5
2 --- 5.5

Constructed mkl representation of X
Multiplication status = 0
Calculated X*v via mkl
   0.5000
        0
        0
        0

As we can see, the result of Armadillo's multiplication is correct, whereas the one from MKL is wrong. My question is this: Am I making a mistake somewhere? Or is there something wrong with MKL? . I suspect the former of course, but after spending considerable amount of time, cannot find anything. Any help would be much appreciated!

EDIT

As CJR and Vidyalatha_Intel suggested, I have changed col_end to

 cols_end[i] = static_cast<MKL_INT>((X.end_col(i)).pos());

The result is now

X = 
   0.5000   2.5000        0        0
   3.5000        0        0        0
        0        0        0   5.5000
        0        0   4.5000        0

v = 
   1.0000
   1.0000
   1.0000
   1.0000

X * v = 
   3.0000
   3.5000
   5.5000
   4.5000

0 --- 2
2 --- 3
3 --- 4
4 --- 5

0 --- 0.5
1 --- 3.5
0 --- 2.5
3 --- 4.5
2 --- 5.5

Constructed mkl representation of X
Multiplication status = 0
Calculated X*v via mkl
   4.0000
   2.5000
        0
        0

col_end is indeed 2,3,4,5 as suggested, but the result is still wrong.

Answer 1

Yes, the cols_end array is incorrect as pointed out by CJR. They should be indexed as 2,3,4,5. Please see the documentation regarding the parameter to the function mkl_sparse_d_create_csc

cols_end:

This array contains col indices, such that cols_end[i] - ind - 1 is the last index of col i in the arrays values and row_indx. ind takes 0 for zero-based indexing and 1 for one-based indexing.

https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/blas-and-sparse-blas-routines/inspector-executor-sparse-blas-routines/matrix-manipulation-routines/mkl-sparse-create-csc.html

Change this line

cols_end[i] = static_cast<MKL_INT>((--X.end_col(i)).pos());

to

cols_end[i] = static_cast<MKL_INT>((X.end_col(i)).pos());

Now recompile and run the code. I've tested it and it is showing the correct results. Image with results and compilation command

Using Intel oneAPI MKL to perform sparse matrix with dense vector multiplication

Question

EDIT

1 answers

solution1
1 ACCPTED 2022-03-14 10:20:03

Using Intel oneAPI MKL to perform sparse matrix with dense vector multiplication

Question

EDIT

1 answers

solution1 1 ACCPTED 2022-03-14 10:20:03

solution1
1 ACCPTED 2022-03-14 10:20:03