Why is MKL in parallel not faster than serial in R 3.6?

Question

I am trying to use Intel's MKL with R and adjust the number of threads using the MKL_NUM_THREADS variable.

It loads correctly, and I can see it using 3200% CPU in htop. However, it isn't actually faster than using only one thread.

I've been adapting Dirk Eddelbuettel's guide for centos, but I may have missed some flag or config somewhere.

Here is a simplified version of how I am testing how number of threads relates to job time. I do get expected results when using OpenBlas.

require(callr)
#> Loading required package: callr
f <- function(i)  r(function() crossprod(matrix(1:1e9, ncol=1000))[1], 
      env=c(rcmd_safe_env(),
            R_LD_LIBRARY_PATH=MKL_R_LD_LIBRARY_PATH, 
            MKL_NUM_THREADS=as.character(i), 
            OMP_NUM_THREADS="1")
)

system.time(f(1))
#>    user  system elapsed 
#>  14.675   2.945  17.789
system.time(f(4))
#>    user  system elapsed 
#>  54.528   2.920  19.598
system.time(f(8))
#>    user  system elapsed 
#> 115.628   3.181  20.364
system.time(f(32)) 
#>    user  system elapsed 
#> 787.188   7.249  36.388

^{Created on 2020-05-13 by the reprex package (v0.3.0)}

EDIT 5/18

Per the suggestion to try MKL_VERBOSE=1, I now see the following on stdout which shows it properly calling lapack:

MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 intel_thread
MKL_VERBOSE DSYRK(U,T,1000,1000000,0x7fff436222c0,0x7f71024ef040,1000000,0x7fff436222d0,0x7f7101d4d040,1000) 10.64s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:1

for f(8), it shows NThr:8

MKL_VERBOSE Intel(R) MKL 2020.0 Product build 20191122 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.50GHz lp64 intel_thread
MKL_VERBOSE DSYRK(U,T,1000,1000000,0x7ffe6b39ab40,0x7f4bb52eb040,1000000,0x7ffe6b39ab50,0x7f4bb4b49040,1000) 11.98s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:8

I still am not getting any expected performance increase from extra cores.

EDIT 2

I am able to get the expected results using Microsoft's distribution of MKL, but not with Intel's official distribution as in the walkthrough. It appears that MS is using a GNU threading library; could the problem be in the threading library and not in blas/lapack itself?

Answer 1

Only seeing this now: Did you check the obvious one, ie whether R on CentOS actually picks up the MKL?

As I recall, R on CentOS it is built in a more, ahem, "restricted" mode with the shipped-with-R reference BLAS. And if and when that is the case you simply cannot switch and choose another one as we have done within Debian and Ubuntu for 20+ years as that requires a different initial choice when R is compiled.

Edit: Per subsequent discussions (see comments below) we all re-realized that it is important to have the threading libraries / models aligned. The MKL is an Intel product and defaults to using their threading library, on Linux the GNU compiler is closer to the system and has its own. That latter one needs to be selected. In my writeup / script for the MKL on.deb systems I use

echo "MKL_THREADING_LAYER=GNU" >> /etc/environment

so set this "system-wide" on the machine, one can also add it just to the R environment files.

Answer 2

I am not sure exactly how R call MKL but if the crossprod function calls mkl's gemm underneath then we have to see very good scalability results with such inputs. What is the input problem sizes? MKL supports the verbose mode. This option could help to see the many useful runtime info when dgemm will be running. Could you try to export the MKL_VERBOSE=1 environment and see the log file? Though, I am not pretty sure if R will not suppress the output.

Why is MKL in parallel not faster than serial in R 3.6?

Question

2 answers

solution1
4 ACCPTED 2020-05-16 04:20:29

solution2
0 2020-05-16 03:54:24

Why is MKL in parallel not faster than serial in R 3.6?

Question

2 answers

solution1 4 ACCPTED 2020-05-16 04:20:29

solution2 0 2020-05-16 03:54:24

solution1
4 ACCPTED 2020-05-16 04:20:29

solution2
0 2020-05-16 03:54:24