R performance on Ryzen+Ubuntu: openBLAS/MKL, Rcpp and other improvements?

Question

I'm trying to push the maximum from a Ryzen 3 3950x 16-core machine on Ubuntu 20.04, Microsoft R 3.5.2, with Intel MKL, and the RCpp code is compiled with Sys.setenv(MKL_DEBUG_CPU_TYPE=5) header.

The following are the main operations, that I'd like to optimize for:

Fast multivariate random normal (for which I use the Armadillo version):

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]

using namespace Rcpp;

// [[Rcpp::export]]
arma::mat mvrnormArma(int n, arma::vec mu, arma::mat sigma) {
  int ncols = sigma.n_cols;
  arma::mat Y = arma::randn(n, ncols);
  return arma::repmat(mu, 1, n).t() + Y * arma::chol(sigma);
}

Fast SVD (I found that the base::svd performs better than any Rcpp realization I've found so far, including arma::svd("dc") , probably due to different U,S,V dimensions).
Fast matrix multiplications for various results (found code written in C, rewrote all of it in base R and am finding vast improvements due to multicore vs previous 1-core performance. Can base R matrix operations be further improved?)

I've tried various setups with R4.0.2 and openBLAS (through the Ropenblas package), played with various Intel MKL releases, researched about AMD's BLIS and libflame (which I don't know how to even test with R).

Overall, this setup is able to outperform a laptop with i7-8750h and Microsoft R 3.5.1 (With working MKL) by around 2x, while based on 6 vs 16 cores (and faster RAM), I was expecting at least 3-3.5x improvement (based, eg, by cinebench and similar performance benchmarks).

How can this setup be further improved?

My main issues/questions :

First, I've noticed that the current setup, when ran with 1 worker, is using around 1000-1200% cpu when looking at top call. Through experimentation, I've found that spawning two parallel workers uses most of the cpu, around 85-95%, and delivers the best performance. For example, 3 workers uses whole 100%, but bottlenecks somewhere, drastically reducing the performance for some reason.

I'm guessing that this is a limitation either coming from R/MKL, or something when compiling Rcpp code, since 10-12 cores seems oddly specific. Can this be improved by some hints when compiling Rcpp code ?

Secondly, I'm sure I'm not using the optimal BLAS/LAPACK/etc drivers for the job. My guess is that properly compiled R4.0.2 should be significantly faster than Microsoft R3.5.2 , but I have absolutely no idea what am I missing, whether the AVX/AVX2 are properly called/used, and what else should I try with the machine?

Lastly, I have seen zero guides for calling/working with AMD BLIS/libflame for R. If this is trivial, would appreciate any hints/help of what to look into.

Answer 1

Until any other (hopefully much better) answers pops up, will post here my latest findings by guesswork. Hopefully, someone with a similar machine will find this useful. Will try expanding the answer if any additional improvements comes up.

Guide for clean R compiling. Seems outdated, but hopefully nothing much missing:
Speed up RcppArmadillo: How to link to OpenBlas in an R package
OpenBLAS and IntelMKL examples + Rstudio
OpenBLAS works terrible on my Ryzen + Ubuntu configuration; with 3.10 BLAS , compiled with zen2 hints, uses all the CPU cores, but terribly. top reports 3200% usage for R instance, but total CPU utilisation doesn't rise more than 20-30%. Hence, the performance is at least 3x slower than with Intel MKL.
IntelMKL. Versions till 2019 work with the MKL_DEBUG_CPU_TYPE workaround . Can confirm that intel-mkl-64bit-2019.5-075 works.
For later versions since 2020.0-088 a different workaround is needed. With my benchmarks the performance did not see any improvement, however, this may change with future MKL releases.
The 10-12 hardcoded cap per instance appears to be controlled by several environmental variables. I found the following list as per this old guide . Likely that these may change with later versions, but seems to work with 2019.5-075 :

export MKL_NUM_THREADS=2
export OMP_NESTED="TRUE"
export MKL_DOMAIN_NUM_THREADS="MKL_BLAS=1"
export OMP_NUM_THREADS=1
export MKL_DYNAMIC="TRUE"
export OMP_DYNAMIC="FALSE"

Playing around with various configurations I was able to find that reducing the number of threads and spawning more workers, for my specific benchmark that I've tested on , increased the performance drastically (around 3-4 fold). Even though the claimed CPU usage were similar to multicore variants of the configuration, 2 workers using 16 threads each (totaling in ~70% cpu utilisation) are much slower than 16 workers using 2 threads each (also, similar cpu utilisation). The results may vary with different tasks, so these seem to be the go-to parameters to play with for every longer task.

AMD BLIS . Testing this as an alternative to MKL BLAS, currently only experimenting, but the performance seems to be on par with the Intel MKL with all the fixes. Checked whether BLIS was actually used through perf , for my benchmarks the calls were made bli_dgemmsup_rd_haswell_asm_6x8m , bli_daxpyv_zen_int10 and others. Not sure yet whether the settings for compiling BLIS were optimal. The takeaway here could be that both the MKL and BLIS are actually pushing max from the CPU, given my specific benchmarks... or at least that both libraries are similarly optimized.

Important downside of sticking to AMD BLIS: noticed after months of usage, but there seems to be some unresolved issues with BLIS or LAPACK packed into AMD libraries that I have no idea about. I've noticed random matrix multiplication issues that are not replicable (essentially, hitting into this problem ) and are solved by switching back to MKL libs. I can't say whether there's a problem with my way of building R or the actual libs, so just a warning.

R performance on Ryzen+Ubuntu: openBLAS/MKL, Rcpp and other improvements?

Question

1 answers

solution1
1 2020-08-28 10:53:51

R performance on Ryzen+Ubuntu: openBLAS/MKL, Rcpp and other improvements?

Question

1 answers

solution1 1 2020-08-28 10:53:51

solution1
1 2020-08-28 10:53:51