optimize matrix multiplication in for loop RcppArmadillo

Question

The aim is to implement a fast version of the orthogonal projective non-negative matrix factorization (opnmf) in R. I am translating the matlab code available here .

I implemented a vanilla R version but it is much slower (about 5.5x slower) than the matlab implementation on my data (~ 225000 x 150) for 20 factor solution.

So I thought using c++ might speed up things but its speed is similar to R. I think this can be optimized but not sure how as I am a newbie to c++. Here is a thread that discusses a similar problem.

Here is my RcppArmadillo implementation.

// [[Rcpp::export]]
Rcpp::List arma_opnmf(const arma::mat & X, const arma::mat & W0, double tol=0.00001, int maxiter=10000, double eps=1e-16) {
  arma::mat W = W0;
  arma::mat Wold = W;
  arma::mat XXW = X * (X.t()*W);
  double diffW = 9999999999.9;
  
  Rcout << "The value of maxiter : " << maxiter << "\n";
  Rcout << "The value of tol : " << tol << "\n";
  
  int i;
  for (i = 0; i < maxiter; i++) {
    XXW = X * (X.t()*W);
    W = W % XXW / (W  * (W.t() * XXW));
    //W = W % (X*(X.t()*W)) / (W*((W.t()*X)*(X.t()*W)));
    
    arma::uvec idx = find(W < eps);
    W.elem(idx).fill(eps);
    W = W / norm(W,2);
    diffW = norm(Wold-W, "fro") / norm(Wold, "fro");
    if(diffW < tol) {
      break;
    } else {
      Wold = W;
    }
    
    if(i % 10 == 0) {
      Rcpp::checkUserInterrupt();
    }
    
  }
  return Rcpp::List::create(Rcpp::Named("W")=W,
                            Rcpp::Named("iter")=i,
                            Rcpp::Named("diffW")=diffW);
}

This suggested issue confirms that matlab is quite fast, so is there no hope when using R / c++?

The tests were made on Windows 10 and Ubuntu 16 with R version 4.0.0.

EDIT

After the interesting comments in the answer below. I am posting additional details. I ran tests on a Windows 10 machine with R 3.5.3 (as that's what Microsoft provides) and the comparison shows that RcppArmadillo with Microsoft's R is fastest.

R

   user  system elapsed 
 213.76    7.36  221.42

R with RcppArmadillo

   user  system elapsed 
 179.88    3.44  183.43

Microsoft's Open R

   user  system elapsed 
 167.33    9.96   45.94

Microsoft's Open with RcppArmadillo

    user  system elapsed 
  85.47    4.66   23.56

Answer 1

Are you aware that this code is "ultimately" executed by a pair of libraries called LAPACK and BLAS?

Are you aware that Matlab ships with a highly optimised one? Are you aware that on all systems that R runs on you can change which LAPACK/BLAS is being used.

The difference matters greatly . Just this morning a friend posted this tweet contrasting the same R code running on the same Windows computer but in two different R environments. The six-times faster one "simply" uses a parallel LAPACK/BLAS implementation.

Here, you haven't even told us which operating system you are on. You can get OpenBLAS (which uses parallelism) for all OSs that R runs on. You can even get the Intel MKL (which IIRC is what Matlab uses too) fairly easily on some OSs. For Ubuntu/Debian I published a script on GitHub that does it in one step.

Lastly, many years ago I "inherited" a fast program running in Matlab on a (then-large-ish) Windows computer. I rewrote the Matlab part (carefully and slowly, it's effort) in C++ using RcppArmadillo leading a few factors of improvement -- and because we could run that (now open source) code in parallel from R on the same computer another few factors. Together it was orders of magnitude turning a day-long simulation into something that ran a few minutes. So "yes, you can".

Edit: As you have access to Ubuntu, you can switch from basic LAPACK/BLAS to OpenBLAS via a single command, though I am no longer that familiar with Ubuntu 16.04 (as I run 20.04 myself).

Edit 2: Picking up the comparison from Josef's tweet , the Docker r-base container I also maintainer (as part of the Rocker Project ) can use OpenBLAS. [1] So once we add it, eg via apt-get install libopenblas-dev the timing of a simple repeated matrix crossproduct moves from

root@0eb44b1fcc06:/# Rscript -e 'v <- matrix(1:1e6,1e3); system.time(replicate(10, crossprod(v,v)))'
   user  system elapsed 
  9.289   0.084   9.373 
root@0eb44b1fcc06:/#

to

root@67bd334f53d4:/# Rscript -e 'v <- matrix(1:1e6,1e3); system.time(replicate(10, crossprod(v,v)))'
   user  system elapsed 
  2.259   2.370   0.447 
root@67bd334f53d4:/#

which is substantial.

optimize matrix multiplication in for loop RcppArmadillo

Question

1 answers

solution1
4 ACCPTED 2020-06-20 13:39:35

optimize matrix multiplication in for loop RcppArmadillo

Question

1 answers

solution1 4 ACCPTED 2020-06-20 13:39:35

solution1
4 ACCPTED 2020-06-20 13:39:35