矩阵求逆并转置R vs c ++

Question

我有计算网络的功率流量。 但是，我发现对于较大的网络，计算速度很慢。 我尝试使用RcppArmadillo实现该算法。 对于小型网络/矩阵，Rcpp函数要快几倍，但对于较大的网络/矩阵，Rcpp函数的速度要慢一些。 操作的速度正在进一步影响功能的有用性。 我写的功能很糟糕还是这只是我的选择？

下面的R代码给出了10,100,1000的矩阵的执行时间的示例。

library(igraph); library(RcppArmadillo)

#Create the basic R implementation of the equation
flowCalc <- function(A,C){

  B <- t(A) %*% C %*% A
  Imp <- solve(B)
  PTDF <- C %*% A %*% Imp
  Out <- list(Imp = Imp, PTDF= PTDF)
  return(Out)

}

#Create the c++ implementation
txt <- 'arma::mat Am = Rcpp::as< arma::mat >(A);
   arma::mat Cm = Rcpp::as< arma::mat >(C);
   arma::mat B = inv(trans(Am) * Cm * Am);
   arma::mat PTDF = Cm * Am * B;
   return Rcpp::List::create( Rcpp::Named("Imp") = B ,
                           Rcpp::Named("PTDF") = PTDF ) ; '

flowCalcCpp <- cxxfunction(signature(A="numeric",
    C="numeric"),
    body=txt,
    plugin="RcppArmadillo")


#Create a function to generate dummy data
MakeData <- function(n, edgep){#make an erdos renyi graph of size x
    g <- erdos.renyi.game(n, edgep)
    #convert to adjacency matrix
    Adjmat <- as_adjacency_matrix(g, sparse = F)
    #create random graph and mask the elements with not edge
    Cmat <- matrix(rnorm(n*n), ncol = n)*Adjmat
    ##Output a list of the two matrices
    list(A = Adjmat, C = Cmat)
}

#generate dummy data
set.seed(133)
Data10 <- MakeData(10, 1/1)
Data100 <- MakeData(100, 1/10)
Data1000 <- MakeData(1000, 1/100)

#Compare results
BenchData <- microbenchmark(
               R10 = flowCalc(Data10$A, Data10$C),
               R100 = flowCalc(Data100$A, Data100$C),
               R1000 = flowCalc(Data1000$A, Data1000$C),
               Cpp10 = flowCalcCpp(Data10$A, Data10$C),
               Cpp100 = flowCalcCpp(Data100$A, Data100$C),
               Cpp1000 = flowCalcCpp(Data1000$A, Data1000$C))

速度结果如下所示。

编辑：

在阅读了Ralf的回答和Dirk的评论后，我使用https://cran.r-project.org/web/packages/gcbd/vignettes/gcbd.pdf来更好地概述BLAS实现中的差异

然后我使用Dirk的指南来安装微软BLAS实现https://github.com/eddelbuettel/mkl4deb （显然，我现在住在Eddelverse）

完成后我按照Ralf的建议安装了ArrayFire和RcppArrayFire。 然后我运行代码并得到以下结果

Unit: microseconds
    expr        min          lq        mean      median          uq         max neval
     R10     37.348     83.2770    389.6690    143.9530    181.8625   25022.315   100
    R100    464.148    587.9130   1187.3686    680.8650    849.0220   32602.678   100
   R1000 143065.901 160290.2290 185946.5887 191150.2335 201894.5840  464179.919   100
   Cpp10     11.946     30.6120    194.8566     55.6825     74.0535   13732.984   100
  Cpp100    357.880    452.7815    987.8059    496.9520    554.5410   39359.877   100
 Cpp1000 102949.722 124813.9535 136898.4688 132852.9335 142645.6450  214611.656   100
    AF10    713.543    833.6270   1135.0745    966.5920   1057.4175    8177.957   100
   AF100   2809.498   3152.5785   3662.5323   3313.0315   3569.7785   12581.535   100
  AF1000  77905.179  81429.2990 127087.2049  82579.6365  87249.0825 3834280.133   100

对于较小的矩阵，速度降低，对于大约100的矩阵，速度降低两倍，但对于较大的矩阵，速度几乎快10倍，这与Ralf的结果一致。 使用C ++的差异也在增长。

在我的数据上BLAS升级它值得使用C ++版本，但可能不是Arrayfire版本，因为我的矩阵不够大。

Answer 1

我已经在运行Debian Linux的双核笔记本电脑和与OpenBLAS链接的R 3.5.1上尝试过您的代码。 此外，我还使用RcppArrayFire使用GPU进行这些计算（没什么特别的，只是内置图形）。 这里的基准测试结果：

Unit: microseconds
    expr        min         lq        mean     median         uq        max neval  cld
     R10     55.488     90.142    117.9538    133.617    138.009    161.604    10 a   
    R100    698.883    711.488   1409.8286    773.352    901.339   5647.804    10 a   
   R1000 198612.971 213531.557 225713.6705 214993.916 226526.513 306675.313    10    d
   Cpp10     16.157     31.478     38.3473     40.529     51.122     52.351    10 a   
  Cpp100    519.527    544.728    573.0099    570.956    610.985    613.400    10 a   
 Cpp1000 174121.655 184865.003 196224.3825 193142.715 207066.037 223900.830    10   c 
    AF10    700.805    790.203   1469.4980   1347.639   1410.717   3824.905    10 a   
   AF100   1376.737   1562.675   1818.8094   1898.797   1949.364   2222.889    10 a   
  AF1000 106398.431 109130.482 118817.6704 118612.679 123193.649 139474.579    10  b

在最小尺寸（R10和Cpp10），您的机器比我的快。 但是已经在R100和Cpp100，但特别是在R1000和Cpp1000，我的执行速度更快。 正如Dirk在评论中指出的那样，您应该研究优化/并行BLAS实现。 在Debian Linux（以及像Ubuntu这样派生的Linux）上，这很简单

  sudo apt-get install libopenblas-base

另见这里。 现在使用GPU的结果：这是非常典型的。 对于较小的矩阵，它比基础R和犰狳都差。 当数据在主存储器和GPU内存之间移动时，使用GPU会引入相当多的开销。 但是对于最大的大小，GPU上的并行执行超过了这种开销，并且性能提升非常好。

这里我的代码供参考。 我冒昧地将inline::cxxfunction更新为Rcpp::cppFunction ：

#Create the basic R implementation of the equation
flowCalc <- function(A,C){
  B <- t(A) %*% C %*% A
  Imp <- solve(B)
  PTDF <- C %*% A %*% Imp
  Out <- list(Imp = Imp, PTDF= PTDF)
  return(Out)

}

# Create Armadillo function
Rcpp::cppFunction(depends = "RcppArmadillo", code = '
Rcpp::List flowCalcCpp(const arma::mat &Am, const arma::mat &Cm) {
   arma::mat B = inv(trans(Am) * Cm * Am);
   arma::mat PTDF = Cm * Am * B;
   return Rcpp::List::create( Rcpp::Named("Imp") = B ,
                           Rcpp::Named("PTDF") = PTDF );
}')

# Create ArrayFire function
Rcpp::cppFunction(depends = "RcppArrayFire", code = '
Rcpp::List flowCalcAF(const RcppArrayFire::typed_array<f32> &A, 
                      const RcppArrayFire::typed_array<f32> &C) {
  af::array B = af::inverse(af::matmul(af::matmulTN(A, C), A));
  af::array PTDF = af::matmul(af::matmul(C, A), B);
  return Rcpp::List::create( Rcpp::Named("Imp") = B ,
                             Rcpp::Named("PTDF") = PTDF );
}')



library(igraph)
#Create a function to generate dummy data
MakeData <- function(n, edgep){#make an erdos renyi graph of size x
  g <- erdos.renyi.game(n, edgep)
  #convert to adjacency matrix
  Adjmat <- as_adjacency_matrix(g, sparse = F)
  #create random graph and mask the elements with not edge
  Cmat <- matrix(rnorm(n*n), ncol = n)*Adjmat
  ##Output a list of the two matrices
  list(A = Adjmat, C = Cmat)
}

#generate dummy data
set.seed(133)
Data10 <- MakeData(10, 1/1)
Data100 <- MakeData(100, 1/10)
Data1000 <- MakeData(1000, 1/100)

#Compare results
microbenchmark::microbenchmark(
  R10 = flowCalc(Data10$A, Data10$C),
  R100 = flowCalc(Data100$A, Data100$C),
  R1000 = flowCalc(Data1000$A, Data1000$C),
  Cpp10 = flowCalcCpp(Data10$A, Data10$C),
  Cpp100 = flowCalcCpp(Data100$A, Data100$C),
  Cpp1000 = flowCalcCpp(Data1000$A, Data1000$C),
  AF10 = flowCalcAF(Data10$A, Data10$C),
  AF100 = flowCalcAF(Data100$A, Data100$C),
  AF1000 = flowCalcAF(Data1000$A, Data1000$C),
  times = 10)

矩阵求逆并转置R vs c ++

问题描述

编辑：

1 个解决方案

解决方案1
2 已采纳 2018-10-11 09:04:39

矩阵求逆并转置R vs c ++

问题描述

编辑：

1 个解决方案

解决方案1 2 已采纳 2018-10-11 09:04:39

解决方案1
2 已采纳 2018-10-11 09:04:39