简体   繁体   English

当样本大小和概率变化时,进行有效的多项式抽样

[英]Efficient multinomial sampling when sample size and probability vary

This question pertains to efficient sampling from multinomial distributions with varying sample sizes and probabilities. 该问题涉及具有不同样本大小和概率的多项分布的有效采样。 Below I describe the approach I have used, but wonder whether it can be improved with some intelligent vectorisation. 下面我描述了我使用的方法,但想知道它是否可以通过一些智能矢量化来改进。

I'm simulating dispersal of organisms amongst multiple populations. 我正在模拟多个群体中的生物传播。 Individuals from population j disperse to population i with probability p[i, j] . 来自人口j个体以概率p[i, j]分散到种群i Given an initial abundance of 10 in population 1, and probabilities of dispersal c(0.1, 0.3, 0.6) to populations 1, 2, and 3, respectively, we can simulate the dispersal process with rmultinom : 鉴于种群1的初始丰度为10,分别为种群1,2和3的扩散c(0.1, 0.3, 0.6)概率,我们可以用rmultinom模拟分散过程:

set.seed(1)
rmultinom(1, 10, c(0.1, 0.3, 0.6))

#      [,1]
# [1,]    0
# [2,]    3
# [3,]    7

We can extend this to consider n source populations: 我们可以扩展这个来考虑n源群:

set.seed(1)
n <- 3
p <- replicate(n, diff(c(0, sort(runif(n-1)), 1)))
X <- sample(100, n)

Above, p is a matrix of probabilities of moving from one population (column) to another (row), and X is a vector of initial population sizes. 上面, p是从一个群体(列)移动到另一个群体(行)的概率矩阵, X是初始种群大小的向量。 The number of individuals dispersing between each pair of populations (and those remaining where they are) can now be simulated with: 现在可以模拟分散在每对种群(以及它们所在的种群)之间的个体数量:

sapply(seq_len(ncol(p)), function(i) {
  rmultinom(1, X[i], p[, i])  
})

#      [,1] [,2] [,3]
# [1,]   19   42   11
# [2,]    8   18   43
# [3,]   68    6    8

where the value of the element at the i th row and j th column is the number of individuals moving from population j to population i . 其中第i行和第j列的元素值是从群体j移动到群体i的个体数量。 The rowSums of this matrix give the new population sizes. 此矩阵的rowSums给出了新的种群大小。

I'd like to repeat this many times, with constant probability matrix but with varying (pre-defined) initial abundances. 我想用常数概率矩阵重复这一次,但具有不同的(预定义的)初始丰度。 The following small example achieves this, but is inefficient with larger problems. 以下小例子实现了这一点,但是对于更大的问题效率低下。 The resulting matrix gives the post-dispersal abundance in each of three populations for each of 5 simulations for which population had different initial abundances. 得到的基质在5个模拟中的每一个中给出三个群体中的每一个的后扩散丰度,其中群体具有不同的初始丰度。

X <- matrix(sample(100, n*5, replace=TRUE), nrow=n)

apply(sapply(apply(X, 2, function(x) {
  lapply(seq_len(ncol(p)), function(i) {
    rmultinom(1, x[i], p[, i])  
  })
}), function(x) do.call(cbind, x), simplify='array'), 3, rowSums)

#      [,1] [,2] [,3] [,4] [,5]
# [1,]   79   67   45   28   74
# [2,]   92   99   40   19   52
# [3,]   51   45   16   21   35

Is there a way to better vectorise this problem? 有没有办法更好地矢量化这个问题?

This is a RcppGSL implementation of multi-multinomial. 这是多多项式的RcppGSL实现。 However, it requires you to install gsl independently....which may not be very practical. 但是,它需要您独立安装gsl ....这可能不太实用。

// [[Rcpp::depends(RcppGSL)]]

#include <RcppGSL.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <unistd.h>            // getpid

Rcpp::IntegerVector rmn(unsigned int N, Rcpp::NumericVector p, gsl_rng* r){

    size_t K = p.size();

    Rcpp::IntegerVector x(K);
    gsl_ran_multinomial(r, K, N, p.begin(), (unsigned int *) x.begin());
    return x;             // return results vector
}

Rcpp::IntegerVector gsl_mmm_1(Rcpp::IntegerVector N, Rcpp::NumericMatrix P, gsl_rng* r){
    size_t K = N.size();
    int i;
    Rcpp::IntegerVector x(K);
    for(i=0; i<K; i++){
        x += rmn(N[i], P(Rcpp::_, i), r);
    }
    return x;
}

// [[Rcpp::export]]
Rcpp::IntegerMatrix gsl_mmm(Rcpp::IntegerMatrix X_, Rcpp::NumericMatrix P){
    int j;
    gsl_rng * r = gsl_rng_alloc (gsl_rng_mt19937);
    long seed = rand()/(((double)RAND_MAX + 1)/10000000) * getpid();
    gsl_rng_set (r, seed);
    Rcpp::IntegerMatrix X(X_.nrow(), X_.ncol());
    for(j=0; j<X.ncol(); j++){
        X(Rcpp::_, j) = gsl_mmm_1(X_(Rcpp::_,j), P, r);
    }
    gsl_rng_free (r);
    return X;
}

I also compare it with a pure R implementation and jbaums's version 我还将它与纯R实现和jbaums的版本进行比较

library(Rcpp)
library(microbenchmark)
sourceCpp("gsl.cpp")

P = matrix(c(c(0.1,0.2,0.7),c(0.3,0.3,0.4),c(0.5,0.3,0.2)),nc=3)
X = matrix(c(c(30,40,30),c(20,40,40)), nc=2)

mmm = function(X, P){
    n = ncol(X)
    p = nrow(X)
    Reduce("+", lapply(1:p, function(j) {
        Y = matrix(0,p,n)
        for(i in 1:n) Y[,i] = rmultinom(1, X[j,i], P[,j])
        Y
    }))
}

jbaums = function(X,P){
    apply(sapply(apply(X, 2, function(x) {
      lapply(seq_len(ncol(P)), function(i) {
        rmultinom(1, x[i], P[, i])
      })
    }), function(x) do.call(cbind, x), simplify='array'), nrow(X), rowSums)
}
microbenchmark(jbaums(X,P), mmm(X,P), gsl_mmm(X, P))

and this is the result 这就是结果

> microbenchmark(jbaums(X,P), mmm(X,P), gsl_mmm(X, P))
Unit: microseconds
          expr     min       lq  median       uq     max neval
  jbaums(X, P) 165.832 172.8420 179.185 187.2810 339.280   100
     mmm(X, P)  60.071  63.5955  67.437  71.5775  92.963   100
 gsl_mmm(X, P)  10.529  11.8800  13.671  14.6220  40.857   100

The gsl version is about 6x faster than my pure R version. gsl版本比纯R版本快6倍。

For example: 例如:

# make the example in Rcpp you mention:
library(Rcpp)
library(inline)
src <- 'Environment stats("package:stats");
Function rmultinom = stats["rmultinom"];
NumericVector some_p(1000, 1.0/1000);
return(rmultinom(1,1, some_p));'

fx <- rcpp(signature(), body=src)

# now compare the two
library(rbenchmark)
benchmark(fx(),rmultinom(1,1,c(1000,1/1000)),replications=10000)

#                            test replications elapsed relative user.self sys.self user.child sys.child
#    1                       fx()        10000   1.126   13.901     1.128        0          0         0
#    2 rmultinom(1, 1, c(1/1000))        10000   0.081    1.000     0.080        0          0         0

I've discovered that the BH package brings boost libraries to the table. 我发现, BH包带来boost库表。 This enables the following, which produces the same output as @RandyLai's gsl_mmm and as the code in my question above. 这将启用以下内容,它产生与@ RandyLai的gsl_mmm相同的输出以及上面问题中的代码。 (I believe enabling c++11 support should make random available without BH .) (我相信启用c ++ 11支持应该在没有BH情况下random可用。)

// [[Rcpp::depends(BH)]]
#include <Rcpp.h>

#include <boost/random.hpp>
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/discrete_distribution.hpp>

using namespace Rcpp;

typedef boost::mt19937 RNGType;
RNGType rng(123);


NumericVector rowSumsC(IntegerMatrix x) {
  int nrow = x.nrow(), ncol = x.ncol();
  IntegerVector out(nrow);

  for (int i = 0; i < nrow; i++) {
    double total = 0;
    for (int j = 0; j < ncol; j++) {
      total += x(i, j);
    }
    out[i] = total;
  }
  return wrap(out);
}

// [[Rcpp::export]]
IntegerMatrix rmm(IntegerMatrix X, NumericMatrix P) {
  int niter = X.ncol(), nx = X.nrow();
  IntegerMatrix out(nx, niter);
  for (int j = 0; j < niter; j++) {
    IntegerMatrix tmp(nx, nx);
    for (int i = 0; i < nx; i++) {
      for (int n = 0; n < X(i, j); n++) {
        boost::random::discrete_distribution<> dist(P(_, i));
        tmp(dist(rng), i)++;
      }
    }
    out(_, j) = rowSumsC(tmp);
  }
  return out;
}

rowSumsC provided by @hadley, here . rowSumsCrowSumsC提供, 在这里

However, on my machine, this is considerably slower than Randy's gsl_mmm , and indeed slower than my R version when there are many trials. 但是,在我的机器上,这比Randy的gsl_mmm慢得多,而且当有很多试验时,确实比我的R版慢。 I suspect this is due to inefficient coding, but boost's discrete_distribution also performs each multinomial trial individually whereas this process appears vectorised when using gsl . 我怀疑这是由于编码效率低,但是boost的discrete_distribution也单独执行每个多项式试验,而这个过程在使用gsl时会出现矢量化。 I'm new to c++ so not sure whether this can be made more efficient. 我是c ++的新手,所以不确定这是否可以提高效率。

P <- matrix(c(c(0.1, 0.2, 0.7), c(0.3, 0.3, 0.4), c(0.5, 0.3, 0.2)), nc=3)
X <- matrix(c(c(30, 40, 30), c(20, 40, 40)), nc=2)
library(BH)
microbenchmark(jbaums(X, P), rmm(X, P))

# Unit: microseconds
#          expr     min       lq  median       uq     max neval
#  jbaums(X, P) 124.988 129.5065 131.464 133.8735 348.763   100
#     rmm(X, P)  59.031  60.0850  62.043  62.6450 117.459   100

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM