用 Rcpp 优化 R 目标函数变慢，为什么？

Question

I am currently working on a Bayesian method that requires multiple steps of optimisation of a multinomial logit model per iteration.我目前正在研究一种贝叶斯方法，该方法需要在每次迭代中对多项式 logit 模型进行多个优化步骤。 I am using optim() to perform those optimisations, and an objective function written in R. A profiling revealed that optim() is the main bottleneck.我正在使用 optim() 来执行这些优化，并使用 R 编写的目标函数。分析显示 optim() 是主要瓶颈。

After digging around, I found this question in which they suggest that recoding the objective function with Rcpp could speed up the process.在挖掘之后，我发现了这个问题，其中他们建议使用Rcpp重新编码目标函数可以加快进程。 I followed the suggestion and recoded my objective function with Rcpp , but it ended up being slower (about two times slower!).我遵循了建议并用Rcpp重新编码了我的目标函数，但它最终变慢了（大约慢了两倍！）。

This was my first time with Rcpp (or anything related to C++) and I was not able to find a way of vectorising the code.这是我第一次使用Rcpp （或任何与 C++ 相关的东西），我无法找到一种对代码进行矢量化的方法。 Any idea how to make it faster?知道如何使它更快吗？

Tl;dr: Current implementation of function in Rcpp is not as fast as vectorised R; Tl; dr：当前 Rcpp 中函数的实现不如矢量化 R 快； how to make it faster?如何让它更快？

A reproducible example :一个可重现的例子：

Define objective functions in R and Rcpp : log-likelihood of an intercept only multinomial model在R和Rcpp定义目标函数：仅截取多项式模型的对数似然

library(Rcpp)
library(microbenchmark)

llmnl_int <- function(beta, Obs, n_cat) {
  n_Obs     <- length(Obs)
  Xint      <- matrix(c(0, beta), byrow = T, ncol = n_cat, nrow = n_Obs)
  ind       <- cbind(c(1:n_Obs), Obs)
  Xby       <- Xint[ind]
  Xint      <- exp(Xint)
  iota      <- c(rep(1, (n_cat)))
  denom     <- log(Xint %*% iota)
  return(sum(Xby - denom))
}

cppFunction('double llmnl_int_C(NumericVector beta, NumericVector Obs, int n_cat) {

    int n_Obs = Obs.size();
    
    NumericVector betas = (beta.size()+1);
    for (int i = 1; i < n_cat; i++) {
        betas[i] = beta[i-1];
    };
    
    NumericVector Xby = (n_Obs);
    NumericMatrix Xint(n_Obs, n_cat);
    NumericVector denom = (n_Obs);
    for (int i = 0; i < Xby.size(); i++) {
        Xint(i,_) = betas;
        Xby[i] = Xint(i,Obs[i]-1.0);
        Xint(i,_) = exp(Xint(i,_));
        denom[i] = log(sum(Xint(i,_)));
    };

    return sum(Xby - denom);
}')

Compare their efficiency:比较它们的效率：

## Draw sample from a multinomial distribution
set.seed(2020)
mnl_sample <- t(rmultinom(n = 1000,size = 1,prob = c(0.3, 0.4, 0.2, 0.1)))
mnl_sample <- apply(mnl_sample,1,function(r) which(r == 1))

## Benchmarking
microbenchmark("llmml_int" = llmnl_int(beta = c(4,2,1), Obs = mnl_sample, n_cat = 4),
               "llmml_int_C" = llmnl_int_C(beta = c(4,2,1), Obs = mnl_sample, n_cat = 4),
               times = 100)
## Results
# Unit: microseconds
#         expr     min       lq     mean   median       uq     max neval
#    llmnl_int  76.809  78.6615  81.9677  79.7485  82.8495 124.295   100
#  llmnl_int_C 155.405 157.7790 161.7677 159.2200 161.5805 201.655   100

Now calling them in optim :现在在optim调用它们：

## Benchmarking with optim
microbenchmark("llmnl_int" = optim(c(4,2,1), llmnl_int, Obs = mnl_sample, n_cat = 4, method = "BFGS", hessian = T, control = list(fnscale = -1)),
               "llmnl_int_C" = optim(c(4,2,1), llmnl_int_C, Obs = mnl_sample, n_cat = 4, method = "BFGS", hessian = T, control = list(fnscale = -1)),
               times = 100)
## Results
# Unit: milliseconds
#         expr      min       lq     mean   median       uq      max neval
#    llmnl_int 12.49163 13.26338 15.74517 14.12413 18.35461 26.58235   100
#  llmnl_int_C 25.57419 25.97413 28.05984 26.34231 30.44012 37.13442   100

I was somewhat surprised that the vectorised implementation in R was faster.我对 R 中的矢量化实现速度更快感到有些惊讶。 Implementing a more efficient version in Rcpp (say, with RcppArmadillo?) can produce any gains?在 Rcpp 中实现更高效的版本（例如，使用 RcppArmadillo？）可以产生任何收益吗？ Is it a better idea to recode everything in Rcpp using a C++ optimiser?使用 C++ 优化器在 Rcpp 中重新编码所有内容是否更好？

Answer 1

In general if you are able to use vectorized functions, you will find it to be (almost) as fast as running your code directly in Rcpp.一般来说，如果您能够使用矢量化函数，您会发现它（几乎）与直接在 Rcpp 中运行您的代码一样快。 This is because many vectorized functions in R (almost all vectorized functions in Base R) are written in C, Cpp or Fortran and as such there is often little to gain.这是因为 R 中的许多向量化函数（几乎所有 Base R 中的向量化函数）都是用 C、Cpp 或 Fortran 编写的，因此通常没什么好处。

That said, there are improvements to gain both in your R and Rcpp code.也就是说，您的R和Rcpp代码都获得了改进。 Optimization comes from carefully studying the code, and removing unnecessary steps (memory assignment, sums, etc.).优化来自仔细研究代码，并删除不必要的步骤（内存分配、求和等）。

Lets start with the Rcpp code optimization.让我们从Rcpp代码优化开始。

In your case the main optimization is to remove unnecessary matrix and vector calculations.在您的情况下，主要优化是删除不必要的矩阵和向量计算。 The code is in essence代码本质上是

Shift beta班次测试版
calculate the log of the sum of exp(shift beta) [log-sum-exp]计算 exp(shift beta) [log-sum-exp] 和的对数
use Obs as an index for the shifted beta and sum over all the probabilities使用 Obs 作为偏移 Beta 的索引并对所有概率求和
substract the log-sum-exp减去 log-sum-exp

Using this observation we can reduce your code to 2 for-loops.使用这种观察，我们可以将您的代码减少到 2 个 for 循环。 Note that sum is simply another for-loop (more or less: for(i = 0; i < max; i++){ sum += x } ) so avoiding the sums can speed up ones code further (in most situations this is unnecessary optimization!).请注意， sum只是另一个 for 循环（或多或少： for(i = 0; i < max; i++){ sum += x } ），因此避免求和可以进一步加速代码（在大多数情况下这是不必要的）优化！）。 In addition your input Obs is an integer vector, and we can further optimize the code by using the IntegerVector type to avoid casting the double elements to integer values (Credit to Ralf Stubner's answer).此外，您的输入Obs是一个整数向量，我们可以通过使用IntegerVector类型进一步优化代码，以避免将double元素转换为integer数值（归功于 Ralf Stubner 的回答）。

cppFunction('double llmnl_int_C_v2(NumericVector beta, IntegerVector Obs, int n_cat)
 {

    int n_Obs = Obs.size();

    NumericVector betas = (beta.size()+1);
    //1: shift beta
    for (int i = 1; i < n_cat; i++) {
        betas[i] = beta[i-1];
    };
    //2: Calculate log sum only once:
    double expBetas_log_sum = log(sum(exp(betas)));
    // pre allocate sum
    double ll_sum = 0;
    
    //3: Use n_Obs, to avoid calling Xby.size() every time 
    for (int i = 0; i < n_Obs; i++) {
        ll_sum += betas(Obs[i] - 1.0) ;
    };
    //4: Use that we know denom is the same for all I:
    ll_sum = ll_sum - expBetas_log_sum * n_Obs;
    return ll_sum;
}')

Note that I have removed quite a few memory allocations and removed unnecessary calculations in the for-loop.请注意，我已经删除了很多内存分配并删除了 for 循环中不必要的计算。 Also i have used that denom is the same for all iterations and simply multiplied for the final result.我还使用了所有迭代都相同的denom并简单地乘以最终结果。

We can perform similar optimizations in your R-code, which results in the below function:我们可以在您的 R 代码中执行类似的优化，从而产生以下功能：

llmnl_int_R_v2 <- function(beta, Obs, n_cat) {
    n_Obs <- length(Obs)
    betas <- c(0, beta)
    #note: denom = log(sum(exp(betas)))
    sum(betas[Obs]) - log(sum(exp(betas))) * n_Obs
}

Note the complexity of the function has been drastically reduced making it simpler for others to read.请注意，该函数的复杂性已大大降低，使其他人更容易阅读。 Just to be sure that I haven't messed up in the code somewhere let's check that they return the same results:为了确保我没有在某处搞乱代码，让我们检查它们是否返回相同的结果：

set.seed(2020)
mnl_sample <- t(rmultinom(n = 1000,size = 1,prob = c(0.3, 0.4, 0.2, 0.1)))
mnl_sample <- apply(mnl_sample,1,function(r) which(r == 1))

beta = c(4,2,1)
Obs = mnl_sample 
n_cat = 4
xr <- llmnl_int(beta = beta, Obs = mnl_sample, n_cat = n_cat)
xr2 <- llmnl_int_R_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat)
xc <- llmnl_int_C(beta = beta, Obs = mnl_sample, n_cat = n_cat)
xc2 <- llmnl_int_C_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat)
all.equal(c(xr, xr2), c(xc, xc2))
TRUE

well that's a relief.嗯，这是一种解脱。

Performance:表现：

I'll use microbenchmark to illustrate the performance.我将使用微基准测试来说明性能。 The optimized functions are fast, so I'll run the functions 1e5 times to reduce the effect of the garbage collector优化后的函数速度很快，所以我将这些函数运行1e5次以减少垃圾收集器的影响

microbenchmark("llmml_int_R" = llmnl_int(beta = beta, Obs = mnl_sample, n_cat = n_cat),
               "llmml_int_C" = llmnl_int_C(beta = beta, Obs = mnl_sample, n_cat = n_cat),
               "llmnl_int_R_v2" = llmnl_int_R_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat),
               "llmml_int_C_v2" = llmnl_int_C_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat),
               times = 1e5)
#Output:
#Unit: microseconds
#           expr     min      lq       mean  median      uq        max neval
#    llmml_int_R 202.701 206.801 288.219673 227.601 334.301  57368.902 1e+05
#    llmml_int_C 250.101 252.802 342.190342 272.001 399.251 112459.601 1e+05
# llmnl_int_R_v2   4.800   5.601   8.930027   6.401   9.702   5232.001 1e+05
# llmml_int_C_v2   5.100   5.801   8.834646   6.700  10.101   7154.901 1e+05

Here we see the same result as before.在这里，我们看到了与之前相同的结果。 Now the new functions are roughly 35x faster (R) and 40x faster (Cpp) compared to their first counter-parts.现在，与它们的第一个对应部分相比，新功能大约快 35 倍 (R) 和快 40 倍 (Cpp)。 Interestingly enough the optimized R function is still very slightly (0.3 ms or 4 %) faster than my optimized Cpp function.有趣的是，优化后的R函数仍然比我优化的Cpp函数快一点（0.3 毫秒或 4%）。 My best bet here is that there is some overhead from the Rcpp package, and if this was removed the two would be identical or the R.我最好的选择是Rcpp包有一些开销，如果将其删除，则两者将相同或 R。

Similarly we can check performance using Optim.同样，我们可以使用 Optim 检查性能。

microbenchmark("llmnl_int" = optim(beta, llmnl_int, Obs = mnl_sample, 
                                   n_cat = n_cat, method = "BFGS", hessian = F, 
                                   control = list(fnscale = -1)),
               "llmnl_int_C" = optim(beta, llmnl_int_C, Obs = mnl_sample, 
                                     n_cat = n_cat, method = "BFGS", hessian = F, 
                                     control = list(fnscale = -1)),
               "llmnl_int_R_v2" = optim(beta, llmnl_int_R_v2, Obs = mnl_sample, 
                                     n_cat = n_cat, method = "BFGS", hessian = F, 
                                     control = list(fnscale = -1)),
               "llmnl_int_C_v2" = optim(beta, llmnl_int_C_v2, Obs = mnl_sample, 
                                     n_cat = n_cat, method = "BFGS", hessian = F, 
                                     control = list(fnscale = -1)),
               times = 1e3)
#Output:
#Unit: microseconds
#           expr       min        lq      mean    median         uq      max neval
#      llmnl_int 29541.301 53156.801 70304.446 76753.851  83528.101 196415.5  1000
#    llmnl_int_C 36879.501 59981.901 83134.218 92419.551 100208.451 190099.1  1000
# llmnl_int_R_v2   667.802  1253.452  1962.875  1585.101   1984.151  22718.3  1000
# llmnl_int_C_v2   704.401  1248.200  1983.247  1671.151   2033.401  11540.3  1000

Once again the result is the same.结果又是一样的。

Conclusion:结论：

As a short conclusion it is worth noting that this is one example, where converting your code to Rcpp is not really worth the trouble.作为一个简短的结论，值得注意的是，这是一个示例，其中将您的代码转换为 Rcpp 并不真正值得麻烦。 This is not always the case, but often it is worth taking a second look at your function, to see if there are areas of your code, where unnecessary calculations are performed.情况并非总是如此，但通常值得再次查看您的函数，以查看您的代码中是否存在执行不必要计算的区域。 Especially in situations where one uses buildin vectorized functions, it is often not worth the time to convert code to Rcpp.特别是在使用内置向量化函数的情况下，通常不值得花时间将代码转换为 Rcpp。 More often one can see great improvements if one uses for-loops with code that cant easily be vectorized in order to remove the for-loop.如果将for-loops与无法轻松矢量化的代码一起使用以删除 for for-loops ，则通常可以看到很大的改进。

Answer 2

I can think of four potential optimizations over Ralf's and Olivers answers.我可以想到对 Ralf 和 Olivers 答案的四个潜在优化。

(You should accept their answers, but I just wanted to add my 2 cents). （你应该接受他们的答案，但我只想加上我的 2 美分）。

1) Use // [[Rcpp::export(rng = false)]] as a comment header to the function in a seperate C++ file. 1) 使用// [[Rcpp::export(rng = false)]]作为单独 C++ 文件中函数的注释头。 This leads to a ~80% speed up on my machine.这导致我机器上的速度提高了约 80%。 (This is the most important suggestion out of the 4). （这是 4 条建议中最重要的一条）。

2) Prefer cmath when possible. 2) 尽可能选择cmath 。 (In this case, it doesn't seem to make a difference). （在这种情况下，它似乎没有什么区别）。

3) Avoid allocation whenever possible, eg don't shift beta into a new vector. 3) 尽可能避免分配，例如不要将beta转移到新向量中。

4) Stretch goal: use SEXP parameters rather than Rcpp vectors. 4) 拉伸目标：使用SEXP参数而不是 Rcpp 向量。 (Left as an exercise to the reader). （留给读者作为练习）。 Rcpp vectors are very thin wrappers, but they're still wrappers and there is a small overhead. Rcpp 向量是非常薄的包装器，但它们仍然是包装器，并且开销很小。

These suggestions wouldn't be important, if not for the fact that you're calling the function in a tight loop in optim .这些建议并不重要，如果不是因为您在optim的紧密循环中调用该函数。 So any overhead is very important.所以任何开销都非常重要。

Bench:长椅：

microbenchmark("llmnl_int_R_v1" = optim(beta, llmnl_int, Obs = mnl_sample, 
                                      n_cat = n_cat, method = "BFGS", hessian = F, 
                                      control = list(fnscale = -1)),
             "llmnl_int_R_v2" = optim(beta, llmnl_int_R_v2, Obs = mnl_sample, 
                                      n_cat = n_cat, method = "BFGS", hessian = F, 
                                      control = list(fnscale = -1)),
             "llmnl_int_C_v2" = optim(beta, llmnl_int_C_v2, Obs = mnl_sample, 
                                      n_cat = n_cat, method = "BFGS", hessian = F, 
                                      control = list(fnscale = -1)),
             "llmnl_int_C_v3" = optim(beta, llmnl_int_C_v3, Obs = mnl_sample, 
                                      n_cat = n_cat, method = "BFGS", hessian = F, 
                                      control = list(fnscale = -1)),
             "llmnl_int_C_v4" = optim(beta, llmnl_int_C_v4, Obs = mnl_sample, 
                                      n_cat = n_cat, method = "BFGS", hessian = F, 
                                      control = list(fnscale = -1)),
             times = 1000)


Unit: microseconds
expr      min         lq       mean     median         uq        max neval cld
llmnl_int_R_v1 9480.780 10662.3530 14126.6399 11359.8460 18505.6280 146823.430  1000   c
llmnl_int_R_v2  697.276   735.7735  1015.8217   768.5735   810.6235  11095.924  1000  b 
llmnl_int_C_v2  997.828  1021.4720  1106.0968  1031.7905  1078.2835  11222.803  1000  b 
llmnl_int_C_v3  284.519   295.7825   328.5890   304.0325   328.2015   9647.417  1000 a  
llmnl_int_C_v4  245.650   256.9760   283.9071   266.3985   299.2090   1156.448  1000 a

v3 is Oliver's answer with rng=false . v3 是 Oliver 对rng=false的回答。 v4 is with Suggestions #2 and #3 included. v4 包含建议 #2 和 #3。

The function:功能：

#include <Rcpp.h>
#include <cmath>
using namespace Rcpp;

// [[Rcpp::export(rng = false)]]
double llmnl_int_C_v4(NumericVector beta, IntegerVector Obs, int n_cat) {

  int n_Obs = Obs.size();
  //2: Calculate log sum only once:
  // double expBetas_log_sum = log(sum(exp(betas)));
  double expBetas_log_sum = 1.0; // std::exp(0)
  for (int i = 1; i < n_cat; i++) {
    expBetas_log_sum += std::exp(beta[i-1]);
  };
  expBetas_log_sum = std::log(expBetas_log_sum);

  double ll_sum = 0;
  //3: Use n_Obs, to avoid calling Xby.size() every time 
  for (int i = 0; i < n_Obs; i++) {
    if(Obs[i] == 1L) continue;
    ll_sum += beta[Obs[i]-2L];
  };
  //4: Use that we know denom is the same for all I:
  ll_sum = ll_sum - expBetas_log_sum * n_Obs;
  return ll_sum;
}

Answer 3

Your C++ function can be made faster using the following observations.使用以下观察可以使您的 C++ 函数更快。 At least the first might also be used with your R function:至少第一个也可以与您的 R 函数一起使用：

The way you calculate denom[i] is the same for every i .您计算denom[i]方式对于每个i都是相同的。 It therefore makes sense to use a double denom and do this calculation only once.因此，使用double denom并且只进行一次计算是有意义的。 I also factor out subtracting this common term in the end.我还考虑到最后减去这个常用术语。
Your observations are actually an integer vector on the R side, and you are using them as integers in C++ as well.您的观察结果实际上是 R 端的整数向量，并且您也在 C++ 中将它们用作整数。 Using an IntegerVector to begin with makes a lot of casting unnecessary.使用IntegerVector开始不需要进行大量的转换。
You can index a NumericVector using an IntegerVector in C++ as well.您也可以在 C++ 中使用IntegerVector来索引NumericVector 。 I am not sure if this helps performance, but it makes the code a bit shorter.我不确定这是否有助于提高性能，但它使代码更短一些。
Some more changes which are more related to style than performance.更多与风格而非性能相关的变化。

Result:结果：

double llmnl_int_C(NumericVector beta, IntegerVector Obs, int n_cat) {

    int n_Obs = Obs.size();

    NumericVector betas(beta.size()+1);
    for (int i = 1; i < n_cat; ++i) {
        betas[i] = beta[i-1];
    };

    double denom = log(sum(exp(betas)));
    NumericVector Xby = betas[Obs - 1];

    return sum(Xby) - n_Obs * denom;
}

For me this function is roughly ten times faster than your R function.对我来说，这个函数大约比你的 R 函数快十倍。

用 Rcpp 优化 R 目标函数变慢，为什么？

问题描述

3 个解决方案

解决方案1
13 已采纳 2020-02-18 18:40:07

Performance:表现：

Conclusion:结论：

解决方案2
12 2020-02-18 20:30:54

解决方案3
11 2020-02-18 18:28:59

用 Rcpp 优化 R 目标函数变慢，为什么？

问题描述

3 个解决方案

解决方案1 13 已采纳 2020-02-18 18:40:07

Performance:表现：

Conclusion:结论：

解决方案2 12 2020-02-18 20:30:54

解决方案3 11 2020-02-18 18:28:59

解决方案1
13 已采纳 2020-02-18 18:40:07

解决方案2
12 2020-02-18 20:30:54

解决方案3
11 2020-02-18 18:28:59