[英]Elementwise matrix multiplication: R versus Rcpp (How to speed this code up?)
I am new to C++
programming (using Rcpp
for seamless integration into R
), and I would appreciate some advice on how to speed up some calculations. 我是
C++
编程的Rcpp
(使用Rcpp
无缝集成到R
),我很欣赏一些关于如何加速计算的建议。
Consider the following example: 请考虑以下示例:
testmat <- matrix(1:9, nrow=3)
testvec <- 1:3
testmat*testvec
# [,1] [,2] [,3]
#[1,] 1 4 7
#[2,] 4 10 16
#[3,] 9 18 27
Here, R
recycled testvec
so that, loosely speaking, testvec
"became" a matrix of the same dimensions as testmat
for the purpose of this multiplication. 在这里,
R
循环使用testvec
因此,松散地说, testvec
“变成”一个与testmat
相同尺寸的矩阵,用于此乘法的目的。 Then the Hadamard product is returned. 然后返回Hadamard产品。 I wish to implement this behavior using
Rcpp
, that is I want that each element of the i
-th row in the matrix testmat
is multiplied with the i
-th element of the vector testvec
. 我希望使用
Rcpp
实现此行为,即我希望矩阵testmat
中第i
行的每个元素与向量testvec
第i
个元素testvec
。 My benchmarks tell me that my implementations are extremely slow, and I would appreciate advise on how to speed this up. 我的基准测试告诉我,我的实现速度非常慢,我很感激建议如何加快速度。 Here my code:
这是我的代码:
First, using Eigen
: 首先,使用
Eigen
:
#include <RcppEigen.h>
// [[Rcpp::depends(RcppEigen)]]
using namespace Rcpp;
using namespace Eigen;
// [[Rcpp::export]]
NumericMatrix E_matvecprod_elwise(NumericMatrix Xs, NumericVector ys){
Map<MatrixXd> X(as<Map<MatrixXd> >(Xs));
Map<VectorXd> y(as<Map<VectorXd> >(ys));
int k = X.cols();
int n = X.rows();
MatrixXd Y(n,k) ;
// here, I emulate R's recycling. I did not find an easier way of doing this. Any hint appreciated.
for(int i = 0; i < k; ++i) {
Y.col(i) = y;
}
MatrixXd out = X.cwiseProduct(Y);
return wrap(out);
}
Here my implementation using Armadillo
(adjusted to follow Dirk's example, see answer below): 这里是我使用
Armadillo
实现(调整为遵循Dirk的例子,见下面的答案):
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
arma::mat A_matvecprod_elwise(const arma::mat & X, const arma::vec & y){
int k = X.n_cols ;
arma::mat Y = repmat(y, 1, k) ; //
arma::mat out = X % Y;
return out;
}
Benchmarking these solutions using R, Eigen or Armadillo shows that both Eigen and Armadillo are about 2 times slower than R. Is there a way to speed these computations up or to get at least as fast as R? 使用R,Eigen或Armadillo对这些解决方案进行基准测试表明,Eigen和Armadillo都比R慢约2倍。有没有办法加快这些计算速度或至少与R一样快? Are there more elegant ways of setting this up?
是否有更优雅的方式来设置它? Any advise is appreciated and welcome.
任何建议表示赞赏和欢迎。 (I also encourage tangential remarks about programming style in general as I am new to
Rcpp / C++
.) (我也鼓励对编程风格进行切向评论,因为我是
Rcpp / C++
。)
Here some reproducable benchmarks: 这里有一些可重复的基准测试:
# for comparison, define R function:
R_matvecprod_elwise <- function(mat, vec) mat*vec
n <- 50000
k <- 50
X <- matrix(rnorm(n*k), nrow=n)
e <- rnorm(n)
benchmark(R_matvecprod_elwise(X, e), A2_matvecprod_elwise(X, e), E_matvecprod_elwise(X,e),
columns = c("test", "replications", "elapsed", "relative"), order = "relative", replications = 1000)
This yields 这产生了
test replications elapsed relative
1 R_matvecprod_elwise(X, e) 1000 10.89 1.000
2 A_matvecprod_elwise(X, e) 1000 26.87 2.467
3 E_matvecprod_elwise(X, e) 1000 27.73 2.546
As you can see, my Rcpp
-solutions perform quite miserably. 正如你所看到的,我的
Rcpp
-solutions表现得非常糟糕。 Any way to do it better? 有什么方法可以做得更好吗?
If you want to speed up your calculations you will have to be a little careful about not making copies. 如果你想加快计算速度,你必须要小心不要复制。 This usually means sacrificing readability.
这通常意味着牺牲可读性。 Here is a version which makes no copies and modifies matrix X inplace.
这是一个版本,它不会复制并修改矩阵X。
// [[Rcpp::export]]
NumericMatrix Rcpp_matvecprod_elwise(NumericMatrix & X, NumericVector & y){
unsigned int ncol = X.ncol();
unsigned int nrow = X.nrow();
int counter = 0;
for (unsigned int j=0; j<ncol; j++) {
for (unsigned int i=0; i<nrow; i++) {
X[counter++] *= y[i];
}
}
return X;
}
Here is what I get on my machine 这是我在我的机器上得到的
> library(microbenchmark)
> microbenchmark(R=R_matvecprod_elwise(X, e), Arma=A_matvecprod_elwise(X, e), Rcpp=Rcpp_matvecprod_elwise(X, e))
Unit: milliseconds
expr min lq median uq max neval
R 8.262845 9.386214 10.542599 11.53498 12.77650 100
Arma 18.852685 19.872929 22.782958 26.35522 83.93213 100
Rcpp 6.391219 6.640780 6.940111 7.32773 7.72021 100
> all.equal(R_matvecprod_elwise(X, e), Rcpp_matvecprod_elwise(X, e))
[1] TRUE
For starters, I'd write the Armadillo version (interface) as 对于初学者,我会将Armadillo版本(界面)写为
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
arama::mat A_matvecprod_elwise(const arma::mat & X, const arma::vec & y){
int k = X.n_cols ;
arma::mat Y = repmat(y, 1, k) ; //
arma::mat out = X % Y;
return out;
}
as you're doing an additional conversion in and out (though the wrap()
gets added by the glue code). 因为你正在进行额外的转换(虽然
wrap()
被胶水代码添加)。 The const &
is notional (as you learned via your last question, a SEXP
is a pointer object that is lightweight to copy) but better style. const &
is SEXP
(正如您通过上一个问题所了解到的, SEXP
是一个轻量级的指针对象),但风格更好。
You didn't show your benchmark results so I can't comment on the effect of matrix size etc pp. I suspect you might get better answers on rcpp-devel than here. 你没有显示你的基准测试结果所以我不能评论矩阵大小等的影响pp。我怀疑你可能会在rcpp-devel上找到比这里更好的答案。 Your pick.
你的选择。
Edit: If you really want something cheap and fast, I would just do this: 编辑:如果你真的想要便宜又快速的东西,我会这样做:
// [[Rcpp::export]]
mat cheapHadamard(mat X, vec y) {
// should row dim of X versus length of Y here
for (unsigned int i=0; i<y.n_elem; i++) X.row(i) *= y(i);
return X;
}
which allocates no new memory and will hence be faster, and probably be competitive with R. 它没有分配新内存,因此速度更快,可能与R竞争。
Test output: 测试输出:
R> cheapHadamard(testmat, testvec)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 4 10 16
[3,] 9 18 27
R>
My apologies for giving an essentially C answer to a C++ question, but as has been suggested the solution generally lies in the efficient BLAS implementation of things. 我很抱歉给出了一个C ++问题的基本C答案,但正如所建议的那样,解决方案通常在于有效的BLAS实现。 Unfortunately, BLAS itself lacks a Hadamard multiply so you would have to implement your own.
不幸的是,BLAS本身缺少Hadamard乘法,所以你必须实现自己的。
Here is a pure Rcpp implementation that basically calls C code. 这是一个纯粹的Rcpp实现,基本上调用C代码。 If you want to make it proper C++, the worker function can be templated but for most applications using R that isn't a concern.
如果你想使它成为正确的C ++,那么worker函数可以被模板化,但对于大多数使用R的应用程序而言并不是一个问题。 Note that this also operates "in-place", which means that it modifies X without copying it.
请注意,这也是“就地”操作,这意味着它修改X而不复制它。
// it may be necessary on your system to uncomment one of the following
//#define restrict __restrict__ // gcc/clang
//#define restrict __restrict // MS Visual Studio
//#define restrict // remove it completely
#include <Rcpp.h>
using namespace Rcpp;
#include <cstdlib>
using std::size_t;
void hadamardMultiplyMatrixByVectorInPlace(double* restrict x,
size_t numRows, size_t numCols,
const double* restrict y)
{
if (numRows == 0 || numCols == 0) return;
for (size_t col = 0; col < numCols; ++col) {
double* restrict x_col = x + col * numRows;
for (size_t row = 0; row < numRows; ++row) {
x_col[row] *= y[row];
}
}
}
// [[Rcpp::export]]
NumericMatrix C_matvecprod_elwise_inplace(NumericMatrix& X,
const NumericVector& y)
{
// do some dimension checking here
hadamardMultiplyMatrixByVectorInPlace(X.begin(), X.nrow(), X.ncol(),
y.begin());
return X;
}
Here is a version that makes a copy first. 这是一个先制作副本的版本。 I don't know Rcpp well enough to do this natively and not incur a substantial performance hit.
我不太了解Rcpp本身做到这一点,并没有产生实质性的打击。 Creating and returning a
NumericMatrix(numRows, numCols)
on the stack causes the code to run about 30% slower. 在堆栈上创建并返回
NumericMatrix(numRows, numCols)
会导致代码运行速度降低约30%。
#include <Rcpp.h>
using namespace Rcpp;
#include <cstdlib>
using std::size_t;
#include <R.h>
#include <Rdefines.h>
void hadamardMultiplyMatrixByVector(const double* restrict x,
size_t numRows, size_t numCols,
const double* restrict y,
double* restrict z)
{
if (numRows == 0 || numCols == 0) return;
for (size_t col = 0; col < numCols; ++col) {
const double* restrict x_col = x + col * numRows;
double* restrict z_col = z + col * numRows;
for (size_t row = 0; row < numRows; ++row) {
z_col[row] = x_col[row] * y[row];
}
}
}
// [[Rcpp::export]]
SEXP C_matvecprod_elwise(const NumericMatrix& X, const NumericVector& y)
{
size_t numRows = X.nrow();
size_t numCols = X.ncol();
// do some dimension checking here
SEXP Z = PROTECT(Rf_allocVector(REALSXP, (int) (numRows * numCols)));
SEXP dimsExpr = PROTECT(Rf_allocVector(INTSXP, 2));
int* dims = INTEGER(dimsExpr);
dims[0] = (int) numRows;
dims[1] = (int) numCols;
Rf_setAttrib(Z, R_DimSymbol, dimsExpr);
hadamardMultiplyMatrixByVector(X.begin(), X.nrow(), X.ncol(), y.begin(), REAL(Z));
UNPROTECT(2);
return Z;
}
If you're curious about usage of restrict
, it means that you as the programmer enter a contract with the compiler that different bits of memory do not overlap, allowing the compiler to make certain optimizations. 如果您对使用
restrict
感到好奇,那就意味着您作为程序员与编译器签订了一份不同内存不重叠的合同,允许编译器进行某些优化。 The restrict
keyword is part of C++11 (and C99), but many compilers added extensions to C++ for earlier standards. restrict
关键字是C ++ 11(和C99)的一部分,但许多编译器为早期标准添加了C ++扩展。
Some R code to benchmark: 一些R代码进行基准测试:
require(rbenchmark)
n <- 50000
k <- 50
X <- matrix(rnorm(n*k), nrow=n)
e <- rnorm(n)
R_matvecprod_elwise <- function(mat, vec) mat*vec
all.equal(R_matvecprod_elwise(X, e), C_matvecprod_elwise(X, e))
X_dup <- X + 0
all.equal(R_matvecprod_elwise(X, e), C_matvecprod_elwise_inplace(X_dup, e))
benchmark(R_matvecprod_elwise(X, e),
C_matvecprod_elwise(X, e),
C_matvecprod_elwise_inplace(X, e),
columns = c("test", "replications", "elapsed", "relative"),
order = "relative", replications = 1000)
And the results: 结果如下:
test replications elapsed relative
3 C_matvecprod_elwise_inplace(X, e) 1000 3.317 1.000
2 C_matvecprod_elwise(X, e) 1000 7.174 2.163
1 R_matvecprod_elwise(X, e) 1000 10.670 3.217
Finally, the in-place version may actually be faster, as the repeated multiplications into the same matrix can cause some overflow mayhem. 最后,就地版本实际上可能更快,因为重复乘法到同一矩阵可能会导致一些溢出混乱。
Edit: 编辑:
Removed the loop unrolling, as it provided no benefit and was otherwise distracting. 删除循环展开,因为它没有提供任何好处,否则分散注意力。
I'd like to build on Sameer's answer, but I don't have enough reputation to comment. 我想以Sameer的答案为基础,但我没有足够的声誉来评论。
I personally got better performance (about 50%) in Eigen using: 我个人在Eigen中的表现更好(约50%):
return (y.asDiagonal() * X);
Despite the appearance, this does not create an nxn
temporary for y
. 尽管外观,但这并没有为
y
创建一个nxn
临时值。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.