[英]Efficient multinomial sampling when sample size and probability vary
This question pertains to efficient sampling from multinomial distributions with varying sample sizes and probabilities. 该问题涉及具有不同样本大小和概率的多项分布的有效采样。 Below I describe the approach I have used, but wonder whether it can be improved with some intelligent vectorisation. 下面我描述了我使用的方法,但想知道它是否可以通过一些智能矢量化来改进。
I'm simulating dispersal of organisms amongst multiple populations. 我正在模拟多个群体中的生物传播。 Individuals from population j
disperse to population i
with probability p[i, j]
. 来自人口j
个体以概率p[i, j]
分散到种群i
。 Given an initial abundance of 10 in population 1, and probabilities of dispersal c(0.1, 0.3, 0.6)
to populations 1, 2, and 3, respectively, we can simulate the dispersal process with rmultinom
: 鉴于种群1的初始丰度为10,分别为种群1,2和3的扩散c(0.1, 0.3, 0.6)
概率,我们可以用rmultinom
模拟分散过程:
set.seed(1)
rmultinom(1, 10, c(0.1, 0.3, 0.6))
# [,1]
# [1,] 0
# [2,] 3
# [3,] 7
We can extend this to consider n
source populations: 我们可以扩展这个来考虑n
源群:
set.seed(1)
n <- 3
p <- replicate(n, diff(c(0, sort(runif(n-1)), 1)))
X <- sample(100, n)
Above, p
is a matrix of probabilities of moving from one population (column) to another (row), and X
is a vector of initial population sizes. 上面, p
是从一个群体(列)移动到另一个群体(行)的概率矩阵, X
是初始种群大小的向量。 The number of individuals dispersing between each pair of populations (and those remaining where they are) can now be simulated with: 现在可以模拟分散在每对种群(以及它们所在的种群)之间的个体数量:
sapply(seq_len(ncol(p)), function(i) {
rmultinom(1, X[i], p[, i])
})
# [,1] [,2] [,3]
# [1,] 19 42 11
# [2,] 8 18 43
# [3,] 68 6 8
where the value of the element at the i
th row and j
th column is the number of individuals moving from population j
to population i
. 其中第i
行和第j
列的元素值是从群体j
移动到群体i
的个体数量。 The rowSums
of this matrix give the new population sizes. 此矩阵的rowSums
给出了新的种群大小。
I'd like to repeat this many times, with constant probability matrix but with varying (pre-defined) initial abundances. 我想用常数概率矩阵重复这一次,但具有不同的(预定义的)初始丰度。 The following small example achieves this, but is inefficient with larger problems. 以下小例子实现了这一点,但是对于更大的问题效率低下。 The resulting matrix gives the post-dispersal abundance in each of three populations for each of 5 simulations for which population had different initial abundances. 得到的基质在5个模拟中的每一个中给出三个群体中的每一个的后扩散丰度,其中群体具有不同的初始丰度。
X <- matrix(sample(100, n*5, replace=TRUE), nrow=n)
apply(sapply(apply(X, 2, function(x) {
lapply(seq_len(ncol(p)), function(i) {
rmultinom(1, x[i], p[, i])
})
}), function(x) do.call(cbind, x), simplify='array'), 3, rowSums)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 79 67 45 28 74
# [2,] 92 99 40 19 52
# [3,] 51 45 16 21 35
Is there a way to better vectorise this problem? 有没有办法更好地矢量化这个问题?
This is a RcppGSL implementation of multi-multinomial. 这是多多项式的RcppGSL实现。 However, it requires you to install gsl independently....which may not be very practical. 但是,它需要您独立安装gsl ....这可能不太实用。
// [[Rcpp::depends(RcppGSL)]]
#include <RcppGSL.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <unistd.h> // getpid
Rcpp::IntegerVector rmn(unsigned int N, Rcpp::NumericVector p, gsl_rng* r){
size_t K = p.size();
Rcpp::IntegerVector x(K);
gsl_ran_multinomial(r, K, N, p.begin(), (unsigned int *) x.begin());
return x; // return results vector
}
Rcpp::IntegerVector gsl_mmm_1(Rcpp::IntegerVector N, Rcpp::NumericMatrix P, gsl_rng* r){
size_t K = N.size();
int i;
Rcpp::IntegerVector x(K);
for(i=0; i<K; i++){
x += rmn(N[i], P(Rcpp::_, i), r);
}
return x;
}
// [[Rcpp::export]]
Rcpp::IntegerMatrix gsl_mmm(Rcpp::IntegerMatrix X_, Rcpp::NumericMatrix P){
int j;
gsl_rng * r = gsl_rng_alloc (gsl_rng_mt19937);
long seed = rand()/(((double)RAND_MAX + 1)/10000000) * getpid();
gsl_rng_set (r, seed);
Rcpp::IntegerMatrix X(X_.nrow(), X_.ncol());
for(j=0; j<X.ncol(); j++){
X(Rcpp::_, j) = gsl_mmm_1(X_(Rcpp::_,j), P, r);
}
gsl_rng_free (r);
return X;
}
I also compare it with a pure R implementation and jbaums's version 我还将它与纯R实现和jbaums的版本进行比较
library(Rcpp)
library(microbenchmark)
sourceCpp("gsl.cpp")
P = matrix(c(c(0.1,0.2,0.7),c(0.3,0.3,0.4),c(0.5,0.3,0.2)),nc=3)
X = matrix(c(c(30,40,30),c(20,40,40)), nc=2)
mmm = function(X, P){
n = ncol(X)
p = nrow(X)
Reduce("+", lapply(1:p, function(j) {
Y = matrix(0,p,n)
for(i in 1:n) Y[,i] = rmultinom(1, X[j,i], P[,j])
Y
}))
}
jbaums = function(X,P){
apply(sapply(apply(X, 2, function(x) {
lapply(seq_len(ncol(P)), function(i) {
rmultinom(1, x[i], P[, i])
})
}), function(x) do.call(cbind, x), simplify='array'), nrow(X), rowSums)
}
microbenchmark(jbaums(X,P), mmm(X,P), gsl_mmm(X, P))
and this is the result 这就是结果
> microbenchmark(jbaums(X,P), mmm(X,P), gsl_mmm(X, P))
Unit: microseconds
expr min lq median uq max neval
jbaums(X, P) 165.832 172.8420 179.185 187.2810 339.280 100
mmm(X, P) 60.071 63.5955 67.437 71.5775 92.963 100
gsl_mmm(X, P) 10.529 11.8800 13.671 14.6220 40.857 100
The gsl version is about 6x faster than my pure R version. gsl版本比纯R版本快6倍。
For example: 例如:
# make the example in Rcpp you mention:
library(Rcpp)
library(inline)
src <- 'Environment stats("package:stats");
Function rmultinom = stats["rmultinom"];
NumericVector some_p(1000, 1.0/1000);
return(rmultinom(1,1, some_p));'
fx <- rcpp(signature(), body=src)
# now compare the two
library(rbenchmark)
benchmark(fx(),rmultinom(1,1,c(1000,1/1000)),replications=10000)
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 fx() 10000 1.126 13.901 1.128 0 0 0
# 2 rmultinom(1, 1, c(1/1000)) 10000 0.081 1.000 0.080 0 0 0
I've discovered that the BH
package brings boost
libraries to the table. 我发现, BH
包带来boost
库表。 This enables the following, which produces the same output as @RandyLai's gsl_mmm
and as the code in my question above. 这将启用以下内容,它产生与@ RandyLai的gsl_mmm
相同的输出以及上面问题中的代码。 (I believe enabling c++11 support should make random
available without BH
.) (我相信启用c ++ 11支持应该在没有BH
情况下random
可用。)
// [[Rcpp::depends(BH)]]
#include <Rcpp.h>
#include <boost/random.hpp>
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/discrete_distribution.hpp>
using namespace Rcpp;
typedef boost::mt19937 RNGType;
RNGType rng(123);
NumericVector rowSumsC(IntegerMatrix x) {
int nrow = x.nrow(), ncol = x.ncol();
IntegerVector out(nrow);
for (int i = 0; i < nrow; i++) {
double total = 0;
for (int j = 0; j < ncol; j++) {
total += x(i, j);
}
out[i] = total;
}
return wrap(out);
}
// [[Rcpp::export]]
IntegerMatrix rmm(IntegerMatrix X, NumericMatrix P) {
int niter = X.ncol(), nx = X.nrow();
IntegerMatrix out(nx, niter);
for (int j = 0; j < niter; j++) {
IntegerMatrix tmp(nx, nx);
for (int i = 0; i < nx; i++) {
for (int n = 0; n < X(i, j); n++) {
boost::random::discrete_distribution<> dist(P(_, i));
tmp(dist(rng), i)++;
}
}
out(_, j) = rowSumsC(tmp);
}
return out;
}
rowSumsC
provided by @hadley, here . rowSumsC
由rowSumsC
提供, 在这里 。
However, on my machine, this is considerably slower than Randy's gsl_mmm
, and indeed slower than my R version when there are many trials. 但是,在我的机器上,这比Randy的gsl_mmm
慢得多,而且当有很多试验时,确实比我的R版慢。 I suspect this is due to inefficient coding, but boost's discrete_distribution
also performs each multinomial trial individually whereas this process appears vectorised when using gsl
. 我怀疑这是由于编码效率低,但是boost的discrete_distribution
也单独执行每个多项式试验,而这个过程在使用gsl
时会出现矢量化。 I'm new to c++ so not sure whether this can be made more efficient. 我是c ++的新手,所以不确定这是否可以提高效率。
P <- matrix(c(c(0.1, 0.2, 0.7), c(0.3, 0.3, 0.4), c(0.5, 0.3, 0.2)), nc=3)
X <- matrix(c(c(30, 40, 30), c(20, 40, 40)), nc=2)
library(BH)
microbenchmark(jbaums(X, P), rmm(X, P))
# Unit: microseconds
# expr min lq median uq max neval
# jbaums(X, P) 124.988 129.5065 131.464 133.8735 348.763 100
# rmm(X, P) 59.031 60.0850 62.043 62.6450 117.459 100
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.