[英]Optimizing R objective function with Rcpp slower, why?
I am currently working on a Bayesian method that requires multiple steps of optimisation of a multinomial logit model per iteration.我目前正在研究一种贝叶斯方法,该方法需要在每次迭代中对多项式 logit 模型进行多个优化步骤。 I am using optim() to perform those optimisations, and an objective function written in R. A profiling revealed that optim() is the main bottleneck.
我正在使用 optim() 来执行这些优化,并使用 R 编写的目标函数。分析显示 optim() 是主要瓶颈。
After digging around, I found this question in which they suggest that recoding the objective function with Rcpp
could speed up the process.在挖掘之后,我发现了这个问题,其中他们建议使用
Rcpp
重新编码目标函数可以加快进程。 I followed the suggestion and recoded my objective function with Rcpp
, but it ended up being slower (about two times slower!).我遵循了建议并用
Rcpp
重新编码了我的目标函数,但它最终变慢了(大约慢了两倍!)。
This was my first time with Rcpp
(or anything related to C++) and I was not able to find a way of vectorising the code.这是我第一次使用
Rcpp
(或任何与 C++ 相关的东西),我无法找到一种对代码进行矢量化的方法。 Any idea how to make it faster?知道如何使它更快吗?
Tl;dr: Current implementation of function in Rcpp is not as fast as vectorised R; Tl; dr:当前 Rcpp 中函数的实现不如矢量化 R 快; how to make it faster?
如何让它更快?
A reproducible example :一个可重现的例子:
R
and Rcpp
: log-likelihood of an intercept only multinomial modelR
和Rcpp
定义目标函数:仅截取多项式模型的对数似然library(Rcpp)
library(microbenchmark)
llmnl_int <- function(beta, Obs, n_cat) {
n_Obs <- length(Obs)
Xint <- matrix(c(0, beta), byrow = T, ncol = n_cat, nrow = n_Obs)
ind <- cbind(c(1:n_Obs), Obs)
Xby <- Xint[ind]
Xint <- exp(Xint)
iota <- c(rep(1, (n_cat)))
denom <- log(Xint %*% iota)
return(sum(Xby - denom))
}
cppFunction('double llmnl_int_C(NumericVector beta, NumericVector Obs, int n_cat) {
int n_Obs = Obs.size();
NumericVector betas = (beta.size()+1);
for (int i = 1; i < n_cat; i++) {
betas[i] = beta[i-1];
};
NumericVector Xby = (n_Obs);
NumericMatrix Xint(n_Obs, n_cat);
NumericVector denom = (n_Obs);
for (int i = 0; i < Xby.size(); i++) {
Xint(i,_) = betas;
Xby[i] = Xint(i,Obs[i]-1.0);
Xint(i,_) = exp(Xint(i,_));
denom[i] = log(sum(Xint(i,_)));
};
return sum(Xby - denom);
}')
## Draw sample from a multinomial distribution
set.seed(2020)
mnl_sample <- t(rmultinom(n = 1000,size = 1,prob = c(0.3, 0.4, 0.2, 0.1)))
mnl_sample <- apply(mnl_sample,1,function(r) which(r == 1))
## Benchmarking
microbenchmark("llmml_int" = llmnl_int(beta = c(4,2,1), Obs = mnl_sample, n_cat = 4),
"llmml_int_C" = llmnl_int_C(beta = c(4,2,1), Obs = mnl_sample, n_cat = 4),
times = 100)
## Results
# Unit: microseconds
# expr min lq mean median uq max neval
# llmnl_int 76.809 78.6615 81.9677 79.7485 82.8495 124.295 100
# llmnl_int_C 155.405 157.7790 161.7677 159.2200 161.5805 201.655 100
optim
:optim
调用它们:## Benchmarking with optim
microbenchmark("llmnl_int" = optim(c(4,2,1), llmnl_int, Obs = mnl_sample, n_cat = 4, method = "BFGS", hessian = T, control = list(fnscale = -1)),
"llmnl_int_C" = optim(c(4,2,1), llmnl_int_C, Obs = mnl_sample, n_cat = 4, method = "BFGS", hessian = T, control = list(fnscale = -1)),
times = 100)
## Results
# Unit: milliseconds
# expr min lq mean median uq max neval
# llmnl_int 12.49163 13.26338 15.74517 14.12413 18.35461 26.58235 100
# llmnl_int_C 25.57419 25.97413 28.05984 26.34231 30.44012 37.13442 100
I was somewhat surprised that the vectorised implementation in R was faster.我对 R 中的矢量化实现速度更快感到有些惊讶。 Implementing a more efficient version in Rcpp (say, with RcppArmadillo?) can produce any gains?
在 Rcpp 中实现更高效的版本(例如,使用 RcppArmadillo?)可以产生任何收益吗? Is it a better idea to recode everything in Rcpp using a C++ optimiser?
使用 C++ 优化器在 Rcpp 中重新编码所有内容是否更好?
In general if you are able to use vectorized functions, you will find it to be (almost) as fast as running your code directly in Rcpp.一般来说,如果您能够使用矢量化函数,您会发现它(几乎)与直接在 Rcpp 中运行您的代码一样快。 This is because many vectorized functions in R (almost all vectorized functions in Base R) are written in C, Cpp or Fortran and as such there is often little to gain.
这是因为 R 中的许多向量化函数(几乎所有 Base R 中的向量化函数)都是用 C、Cpp 或 Fortran 编写的,因此通常没什么好处。
That said, there are improvements to gain both in your R
and Rcpp
code.也就是说,您的
R
和Rcpp
代码都获得了改进。 Optimization comes from carefully studying the code, and removing unnecessary steps (memory assignment, sums, etc.).优化来自仔细研究代码,并删除不必要的步骤(内存分配、求和等)。
Lets start with the Rcpp
code optimization.让我们从
Rcpp
代码优化开始。
In your case the main optimization is to remove unnecessary matrix and vector calculations.在您的情况下,主要优化是删除不必要的矩阵和向量计算。 The code is in essence
代码本质上是
Using this observation we can reduce your code to 2 for-loops.使用这种观察,我们可以将您的代码减少到 2 个 for 循环。 Note that
sum
is simply another for-loop (more or less: for(i = 0; i < max; i++){ sum += x }
) so avoiding the sums can speed up ones code further (in most situations this is unnecessary optimization!).请注意,
sum
只是另一个 for 循环(或多或少: for(i = 0; i < max; i++){ sum += x }
),因此避免求和可以进一步加速代码(在大多数情况下这是不必要的)优化!)。 In addition your input Obs
is an integer vector, and we can further optimize the code by using the IntegerVector
type to avoid casting the double
elements to integer
values (Credit to Ralf Stubner's answer).此外,您的输入
Obs
是一个整数向量,我们可以通过使用IntegerVector
类型进一步优化代码,以避免将double
元素转换为integer
数值(归功于 Ralf Stubner 的回答)。
cppFunction('double llmnl_int_C_v2(NumericVector beta, IntegerVector Obs, int n_cat)
{
int n_Obs = Obs.size();
NumericVector betas = (beta.size()+1);
//1: shift beta
for (int i = 1; i < n_cat; i++) {
betas[i] = beta[i-1];
};
//2: Calculate log sum only once:
double expBetas_log_sum = log(sum(exp(betas)));
// pre allocate sum
double ll_sum = 0;
//3: Use n_Obs, to avoid calling Xby.size() every time
for (int i = 0; i < n_Obs; i++) {
ll_sum += betas(Obs[i] - 1.0) ;
};
//4: Use that we know denom is the same for all I:
ll_sum = ll_sum - expBetas_log_sum * n_Obs;
return ll_sum;
}')
Note that I have removed quite a few memory allocations and removed unnecessary calculations in the for-loop.请注意,我已经删除了很多内存分配并删除了 for 循环中不必要的计算。 Also i have used that
denom
is the same for all iterations and simply multiplied for the final result.我还使用了所有迭代都相同的
denom
并简单地乘以最终结果。
We can perform similar optimizations in your R-code, which results in the below function:我们可以在您的 R 代码中执行类似的优化,从而产生以下功能:
llmnl_int_R_v2 <- function(beta, Obs, n_cat) {
n_Obs <- length(Obs)
betas <- c(0, beta)
#note: denom = log(sum(exp(betas)))
sum(betas[Obs]) - log(sum(exp(betas))) * n_Obs
}
Note the complexity of the function has been drastically reduced making it simpler for others to read.请注意,该函数的复杂性已大大降低,使其他人更容易阅读。 Just to be sure that I haven't messed up in the code somewhere let's check that they return the same results:
为了确保我没有在某处搞乱代码,让我们检查它们是否返回相同的结果:
set.seed(2020)
mnl_sample <- t(rmultinom(n = 1000,size = 1,prob = c(0.3, 0.4, 0.2, 0.1)))
mnl_sample <- apply(mnl_sample,1,function(r) which(r == 1))
beta = c(4,2,1)
Obs = mnl_sample
n_cat = 4
xr <- llmnl_int(beta = beta, Obs = mnl_sample, n_cat = n_cat)
xr2 <- llmnl_int_R_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat)
xc <- llmnl_int_C(beta = beta, Obs = mnl_sample, n_cat = n_cat)
xc2 <- llmnl_int_C_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat)
all.equal(c(xr, xr2), c(xc, xc2))
TRUE
well that's a relief.嗯,这是一种解脱。
I'll use microbenchmark to illustrate the performance.我将使用微基准测试来说明性能。 The optimized functions are fast, so I'll run the functions
1e5
times to reduce the effect of the garbage collector优化后的函数速度很快,所以我将这些函数运行
1e5
次以减少垃圾收集器的影响
microbenchmark("llmml_int_R" = llmnl_int(beta = beta, Obs = mnl_sample, n_cat = n_cat),
"llmml_int_C" = llmnl_int_C(beta = beta, Obs = mnl_sample, n_cat = n_cat),
"llmnl_int_R_v2" = llmnl_int_R_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat),
"llmml_int_C_v2" = llmnl_int_C_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat),
times = 1e5)
#Output:
#Unit: microseconds
# expr min lq mean median uq max neval
# llmml_int_R 202.701 206.801 288.219673 227.601 334.301 57368.902 1e+05
# llmml_int_C 250.101 252.802 342.190342 272.001 399.251 112459.601 1e+05
# llmnl_int_R_v2 4.800 5.601 8.930027 6.401 9.702 5232.001 1e+05
# llmml_int_C_v2 5.100 5.801 8.834646 6.700 10.101 7154.901 1e+05
Here we see the same result as before.在这里,我们看到了与之前相同的结果。 Now the new functions are roughly 35x faster (R) and 40x faster (Cpp) compared to their first counter-parts.
现在,与它们的第一个对应部分相比,新功能大约快 35 倍 (R) 和快 40 倍 (Cpp)。 Interestingly enough the optimized
R
function is still very slightly (0.3 ms or 4 %) faster than my optimized Cpp
function.有趣的是,优化后的
R
函数仍然比我优化的Cpp
函数快一点(0.3 毫秒或 4%)。 My best bet here is that there is some overhead from the Rcpp
package, and if this was removed the two would be identical or the R.我最好的选择是
Rcpp
包有一些开销,如果将其删除,则两者将相同或 R。
Similarly we can check performance using Optim.同样,我们可以使用 Optim 检查性能。
microbenchmark("llmnl_int" = optim(beta, llmnl_int, Obs = mnl_sample,
n_cat = n_cat, method = "BFGS", hessian = F,
control = list(fnscale = -1)),
"llmnl_int_C" = optim(beta, llmnl_int_C, Obs = mnl_sample,
n_cat = n_cat, method = "BFGS", hessian = F,
control = list(fnscale = -1)),
"llmnl_int_R_v2" = optim(beta, llmnl_int_R_v2, Obs = mnl_sample,
n_cat = n_cat, method = "BFGS", hessian = F,
control = list(fnscale = -1)),
"llmnl_int_C_v2" = optim(beta, llmnl_int_C_v2, Obs = mnl_sample,
n_cat = n_cat, method = "BFGS", hessian = F,
control = list(fnscale = -1)),
times = 1e3)
#Output:
#Unit: microseconds
# expr min lq mean median uq max neval
# llmnl_int 29541.301 53156.801 70304.446 76753.851 83528.101 196415.5 1000
# llmnl_int_C 36879.501 59981.901 83134.218 92419.551 100208.451 190099.1 1000
# llmnl_int_R_v2 667.802 1253.452 1962.875 1585.101 1984.151 22718.3 1000
# llmnl_int_C_v2 704.401 1248.200 1983.247 1671.151 2033.401 11540.3 1000
Once again the result is the same.结果又是一样的。
As a short conclusion it is worth noting that this is one example, where converting your code to Rcpp is not really worth the trouble.作为一个简短的结论,值得注意的是,这是一个示例,其中将您的代码转换为 Rcpp 并不真正值得麻烦。 This is not always the case, but often it is worth taking a second look at your function, to see if there are areas of your code, where unnecessary calculations are performed.
情况并非总是如此,但通常值得再次查看您的函数,以查看您的代码中是否存在执行不必要计算的区域。 Especially in situations where one uses buildin vectorized functions, it is often not worth the time to convert code to Rcpp.
特别是在使用内置向量化函数的情况下,通常不值得花时间将代码转换为 Rcpp。 More often one can see great improvements if one uses
for-loops
with code that cant easily be vectorized in order to remove the for-loop.如果将
for-loops
与无法轻松矢量化的代码一起使用以删除 for for-loops
,则通常可以看到很大的改进。
I can think of four potential optimizations over Ralf's and Olivers answers.我可以想到对 Ralf 和 Olivers 答案的四个潜在优化。
(You should accept their answers, but I just wanted to add my 2 cents). (你应该接受他们的答案,但我只想加上我的 2 美分)。
1) Use // [[Rcpp::export(rng = false)]]
as a comment header to the function in a seperate C++ file. 1) 使用
// [[Rcpp::export(rng = false)]]
作为单独 C++ 文件中函数的注释头。 This leads to a ~80% speed up on my machine.这导致我机器上的速度提高了约 80%。 (This is the most important suggestion out of the 4).
(这是 4 条建议中最重要的一条)。
2) Prefer cmath
when possible. 2) 尽可能选择
cmath
。 (In this case, it doesn't seem to make a difference). (在这种情况下,它似乎没有什么区别)。
3) Avoid allocation whenever possible, eg don't shift beta
into a new vector. 3) 尽可能避免分配,例如不要将
beta
转移到新向量中。
4) Stretch goal: use SEXP
parameters rather than Rcpp vectors. 4) 拉伸目标:使用
SEXP
参数而不是 Rcpp 向量。 (Left as an exercise to the reader). (留给读者作为练习)。 Rcpp vectors are very thin wrappers, but they're still wrappers and there is a small overhead.
Rcpp 向量是非常薄的包装器,但它们仍然是包装器,并且开销很小。
These suggestions wouldn't be important, if not for the fact that you're calling the function in a tight loop in optim
.这些建议并不重要,如果不是因为您在
optim
的紧密循环中调用该函数。 So any overhead is very important.所以任何开销都非常重要。
Bench:长椅:
microbenchmark("llmnl_int_R_v1" = optim(beta, llmnl_int, Obs = mnl_sample,
n_cat = n_cat, method = "BFGS", hessian = F,
control = list(fnscale = -1)),
"llmnl_int_R_v2" = optim(beta, llmnl_int_R_v2, Obs = mnl_sample,
n_cat = n_cat, method = "BFGS", hessian = F,
control = list(fnscale = -1)),
"llmnl_int_C_v2" = optim(beta, llmnl_int_C_v2, Obs = mnl_sample,
n_cat = n_cat, method = "BFGS", hessian = F,
control = list(fnscale = -1)),
"llmnl_int_C_v3" = optim(beta, llmnl_int_C_v3, Obs = mnl_sample,
n_cat = n_cat, method = "BFGS", hessian = F,
control = list(fnscale = -1)),
"llmnl_int_C_v4" = optim(beta, llmnl_int_C_v4, Obs = mnl_sample,
n_cat = n_cat, method = "BFGS", hessian = F,
control = list(fnscale = -1)),
times = 1000)
Unit: microseconds
expr min lq mean median uq max neval cld
llmnl_int_R_v1 9480.780 10662.3530 14126.6399 11359.8460 18505.6280 146823.430 1000 c
llmnl_int_R_v2 697.276 735.7735 1015.8217 768.5735 810.6235 11095.924 1000 b
llmnl_int_C_v2 997.828 1021.4720 1106.0968 1031.7905 1078.2835 11222.803 1000 b
llmnl_int_C_v3 284.519 295.7825 328.5890 304.0325 328.2015 9647.417 1000 a
llmnl_int_C_v4 245.650 256.9760 283.9071 266.3985 299.2090 1156.448 1000 a
v3 is Oliver's answer with rng=false
. v3 是 Oliver 对
rng=false
的回答。 v4 is with Suggestions #2 and #3 included. v4 包含建议 #2 和 #3。
The function:功能:
#include <Rcpp.h>
#include <cmath>
using namespace Rcpp;
// [[Rcpp::export(rng = false)]]
double llmnl_int_C_v4(NumericVector beta, IntegerVector Obs, int n_cat) {
int n_Obs = Obs.size();
//2: Calculate log sum only once:
// double expBetas_log_sum = log(sum(exp(betas)));
double expBetas_log_sum = 1.0; // std::exp(0)
for (int i = 1; i < n_cat; i++) {
expBetas_log_sum += std::exp(beta[i-1]);
};
expBetas_log_sum = std::log(expBetas_log_sum);
double ll_sum = 0;
//3: Use n_Obs, to avoid calling Xby.size() every time
for (int i = 0; i < n_Obs; i++) {
if(Obs[i] == 1L) continue;
ll_sum += beta[Obs[i]-2L];
};
//4: Use that we know denom is the same for all I:
ll_sum = ll_sum - expBetas_log_sum * n_Obs;
return ll_sum;
}
Your C++ function can be made faster using the following observations.使用以下观察可以使您的 C++ 函数更快。 At least the first might also be used with your R function:
至少第一个也可以与您的 R 函数一起使用:
The way you calculate denom[i]
is the same for every i
.您计算
denom[i]
方式对于每个i
都是相同的。 It therefore makes sense to use a double denom
and do this calculation only once.因此,使用
double denom
并且只进行一次计算是有意义的。 I also factor out subtracting this common term in the end.我还考虑到最后减去这个常用术语。
Your observations are actually an integer vector on the R side, and you are using them as integers in C++ as well.您的观察结果实际上是 R 端的整数向量,并且您也在 C++ 中将它们用作整数。 Using an
IntegerVector
to begin with makes a lot of casting unnecessary.使用
IntegerVector
开始不需要进行大量的转换。
You can index a NumericVector
using an IntegerVector
in C++ as well.您也可以在 C++ 中使用
IntegerVector
来索引NumericVector
。 I am not sure if this helps performance, but it makes the code a bit shorter.我不确定这是否有助于提高性能,但它使代码更短一些。
Some more changes which are more related to style than performance.更多与风格而非性能相关的变化。
Result:结果:
double llmnl_int_C(NumericVector beta, IntegerVector Obs, int n_cat) {
int n_Obs = Obs.size();
NumericVector betas(beta.size()+1);
for (int i = 1; i < n_cat; ++i) {
betas[i] = beta[i-1];
};
double denom = log(sum(exp(betas)));
NumericVector Xby = betas[Obs - 1];
return sum(Xby) - n_Obs * denom;
}
For me this function is roughly ten times faster than your R function.对我来说,这个函数大约比你的 R 函数快十倍。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.