Rcpp - generate multiple random observations from custom distribution

This question is related to a previous one on calling functions within functions in Rcpp.

I need to generate a large number of random draws from a custom distribution, in a way similar to rnorm() or rbinom(), with the additional complication that my function produces a vector output.

As a solution, I thought about defining a function that generates observations from the custom distribution, and then a main function that draws n times from the generating function via a for loop. Here below is a much simplified working version of the code:

#include <Rcpp.h>
using namespace Rcpp;

// generating function
NumericVector gen(NumericVector A, NumericVector B){
  NumericVector out = no_init_vector(2); 
  out[0] = R::runif(A[0],A[1]) + R::runif(B[0],B[1]);
  out[1] = R::runif(A[0],A[1]) - R::runif(B[0],B[1]);
  return out;

// [[Rcpp::export]]
// draw n observations
NumericVector rdraw(int n, NumericVector A, NumericVector B){
  NumericMatrix out = no_init_matrix(n, 2);
  for (int i = 0; i < n; ++i) {
    out(i,_) = gen(A, B); 
  return out;

I am looking for ways to speed up the draws. My questions are: is there any more efficient alternative to the for loop? Would parallelization help in this case?

Thank you for any help!

There are different ways to speed this up:

  1. Use inline on gen() , reducing the number of function calls.
  2. Use Rcpp::runif instead of a loop with R::runif to remove even more function calls.
  3. Use a faster RNG that allows for parallel execution.

Here points 1. and 2.:

#include <Rcpp.h>
using namespace Rcpp;

// generating function
inline NumericVector gen(NumericVector A, NumericVector B){
  NumericVector out = no_init_vector(2); 
  out[0] = R::runif(A[0],A[1]) + R::runif(B[0],B[1]);
  out[1] = R::runif(A[0],A[1]) - R::runif(B[0],B[1]);
  return out;

// [[Rcpp::export]]
// draw n observations
NumericVector rdraw(int n, NumericVector A, NumericVector B){
  NumericMatrix out = no_init_matrix(n, 2);
  for (int i = 0; i < n; ++i) {
    out(i,_) = gen(A, B); 
  return out;

// [[Rcpp::export]]
// draw n observations
NumericVector rdraw2(int n, NumericVector A, NumericVector B){
  NumericMatrix out = no_init_matrix(n, 2);
  out(_, 0) = Rcpp::runif(n, A[0],A[1]) + Rcpp::runif(n, B[0],B[1]);
  out(_, 1) = Rcpp::runif(n, A[0],A[1]) - Rcpp::runif(n, B[0],B[1]);
  return out;

/*** R
system.time(rdraw(1e7, c(0,2), c(1,3)))
system.time(rdraw2(1e7, c(0,2), c(1,3)))


> set.seed(42)

> system.time(rdraw(1e7, c(0,2), c(1,3)))
   user  system elapsed 
  1.576   0.034   1.610 

> system.time(rdraw2(1e7, c(0,2), c(1,3)))
   user  system elapsed 
  0.458   0.139   0.598 

For comparison, your original code took about 1.8s for 10^7 draws. For point 3. I am adapting code from the parallel vignette of my dqrng package:

#include <Rcpp.h>
// [[Rcpp::depends(dqrng)]]
#include <xoshiro.h>
#include <dqrng_distribution.h>
// [[Rcpp::plugins(openmp)]]
#include <omp.h>
// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
Rcpp::NumericMatrix rdraw3(int n, Rcpp::NumericVector A, Rcpp::NumericVector B, int seed, int ncores) {
  dqrng::uniform_distribution distA(A(0), A(1));
  dqrng::uniform_distribution distB(B(0), B(1));
  dqrng::xoshiro256plus rng(seed);
  Rcpp::NumericMatrix res = Rcpp::no_init_matrix(n, 2);
  RcppParallel::RMatrix<double> output(res);

  #pragma omp parallel num_threads(ncores)
  dqrng::xoshiro256plus lrng(rng);      // make thread local copy of rng 
  lrng.jump(omp_get_thread_num() + 1);  // advance rng by 1 ... ncores jumps 
  auto genA = std::bind(distA, std::ref(lrng));
  auto genB = std::bind(distB, std::ref(lrng));      

  #pragma omp for
  for (int i = 0; i < n; ++i) {
    output(i, 0) = genA() + genB();
    output(i, 1) = genA() - genB();
  return res;

/*** R
system.time(rdraw3(1e7, c(0,2), c(1,3), 42, 2))


> system.time(rdraw3(1e7, c(0,2), c(1,3), 42, 2))
   user  system elapsed 
  0.276   0.025   0.151 

So with a faster RNG and moderate parallelism, we can gain an order of magnitude in execution time. The results will be different, of course, but summary statistics should be the same.

