矩阵乘法向量 - R 与 Matlab

Question

我观察到，在 R（版本 3.6.1）中用大矩阵进行向量乘法的矩阵比在 Matlab（版本 2019b）中慢得多，而两种语言都依赖于（相同的？）BLAS 库。 请参阅下面的最小示例：

在 Matlab 中：

n=900; 
p=900; 
A=reshape(1:(n*p),[n,p]); 
x=ones(p,1); 
tic()
for id = 1:1000
  x = A*x; 
end
toc()

在 R 中：

n=900
p=900
A=matrix(c(1:(n*p)),nrow=n,ncol=p)
x=rep(1,ncol(A))
t0 <- Sys.time()
for(iter in 1:1000){
  x = A%*%x
}
t1 <- Sys.time()
print(t1-t0)

在使用同一台计算机时，我在 Matlab 中的运行执行时间大约为 0.05 秒，而在 R 中为 3.5 秒。 知道这种差异的原因吗？

谢谢。

[编辑]：我在 C 中添加了一个类似的微积分（使用 CBLAS 库，使用gcc cblas_dgemv.c -lblas -o cblas_dgemv ，其中 cblas_dgemv.c 表示下面的源文件）。 我得到的运行时间大约为 0.08 秒，这与使用 Matlab 获得的运行时间（0.05 秒）非常接近。 我仍在试图找出 R 中如此巨大的运行时间（3.5 秒）的原因。

#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include "cblas.h"

int main(int argc, char **argv)
{
  int m=900,adr;
  double *A,*x,*y;
  struct timeval t0,t1;

  /* memory allocation and initialization */
  A = (double*)malloc(m*m*sizeof(double)); 
  x = (double*)malloc(m*sizeof(double));  
  y = (double*)malloc(m*sizeof(double));  
  for(adr=0;adr<m;adr++) x[adr] = 1.; 
  for(adr=0;adr<m*m;adr++) A[adr] = adr;

  /* main loop */
  gettimeofday(&t0, NULL);
  for(adr=0;adr<1000;adr++)
    cblas_dgemv(CblasColMajor,CblasNoTrans,m,m,1.,A,m,x,1,0.,y,1);
  gettimeofday(&t1, NULL);
  printf("elapsed time = %.2e seconds\n",(double)(t1.tv_usec-t0.tv_usec)/1000000. + (double)(t1.tv_sec-t0.tv_sec));

  /* free memory */
  free(A);
  free(x);
  free(y); 

  return EXIT_SUCCESS;
}

请注意，我无法在 cblas_dgemv 例程中设置 y=x。 因此，此 C 演算与上述 R 和 Matlab 代码中的演算略有不同。 然而，编译是在没有优化标志（没有选项 -O3）的情况下完成的，我检查了在循环的每次迭代中确实调用了矩阵向量乘积（执行 10 倍多的迭代导致运行时间延长 10 倍）。

Answer 1

这里有一些令人震惊的事情：

从 CRAN 下载的预编译 R 发行版使用参考 BLAS/LAPACK 实现进行线性代数运算

“参考 BLAS”是非优化、非加速的 BLAS，与 OpenBLAS 或英特尔 MKL 不同。 Matlab 使用加速的 MKL。

这似乎由我在 macOS 上的 R 3.6.0 中的sessionInfo()确认：

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

如果我没看错，这意味着默认情况下，R 使用慢速 BLAS，如果您希望它运行得更快，则需要进行一些配置以使其使用快速 BLAS。

这对我来说有点令人惊讶。 据我了解，参考 BLAS 通常主要用于测试和开发，而不是用于“实际工作”。

我在 macOS 10.14 上的 R 3.6 与 Matlab R2019b 中得到的时间大致相同：Matlab 中为 0.04 秒，R 中为 4.5 秒。我认为这与使用非加速 BLAS 的 R 一致。

Answer 2

我使用 Rcpp 将矩阵向量乘积的 C++ 实现与 R 接口。

源 C++ 文件 ('cblas_dgemv2.cpp')：执行 'niter' 乘积 y = A*x

#include <cblas-openblas.h>
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
Rcpp::NumericVector cblas_dgemv2(Rcpp::NumericMatrix A, Rcpp::NumericVector x, int niter)
{
  int m=A.ncol(),iter;
  Rcpp::NumericVector y(m);
  for(iter=0;iter<niter;iter++)
    cblas_dgemv(CblasColMajor,CblasNoTrans,m,m,1.,A.begin(),m,x.begin(),1,0.,y.begin(),1);
  return y; 
}

然后我使用下面的 R 代码执行两个实验：

实验 1：从 R 中调用y=cblas_dgmev2(A,x,1000)在 C++ 中执行 1000 次乘积 y=Ax* 的计算；
实验 2：从 R 中调用 1000 次y=cblas_dgemv2(A,x,1) ，每次调用在 C++ 中执行乘积 y=A*x。

# compile cblas_dgemv2 (you may need to update the path to the library)
PKG_LIBS <- '/usr/lib/x86_64-linux-gnu/libopenblas.a' 
PKG_CPPFLAGS <- '-I/usr/include/x86_64-linux-gnu'
Sys.setenv(PKG_LIBS = PKG_LIBS , PKG_CPPFLAGS = PKG_CPPFLAGS) 
Rcpp::sourceCpp('cblas_dgemv2.cpp', rebuild = TRUE)

# create A and x 
n=900
A=matrix(c(1:(n*n)),nrow=n,ncol=n)
x=rep(1,n)

# Experiment 1: call 1 time cblas_dgemv2 (with 1000 iterations in C++)
t0 <- Sys.time()
y=cblas_dgemv2(A,x,1000) # perform 1000 times the computation y = A%*%x 
t1 <- Sys.time()
print(t1-t0)

# Experiment 2: call 1000 times cblas_dgemv2  
t0 <- Sys.time()
for(iter in 1:1000){
  y=cblas_dgemv2(A,x,1) # perform 1 times the computation y = A%*%x 
}
t1 <- Sys.time()
print(t1-t0)

第一个实验的运行时间为 0.08 秒，而第二个实验的运行时间为 4.8 秒。

我的结论是：运行时间方面的瓶颈来自 R 和 C++ 之间的数据传输，而不是来自矩阵向量乘积本身的计算。 令人惊讶，不是吗？

矩阵乘法向量 - R 与 Matlab

问题描述

2 个解决方案

解决方案1
10 2020-02-14 21:27:36

解决方案2
1 已采纳 2020-02-17 11:58:18

矩阵乘法向量 - R 与 Matlab

问题描述

2 个解决方案

解决方案1 10 2020-02-14 21:27:36

解决方案2 1 已采纳 2020-02-17 11:58:18

解决方案1
10 2020-02-14 21:27:36

解决方案2
1 已采纳 2020-02-17 11:58:18