从不同概率向量中采样的有效方法

Question

I'm looking for a more efficient way to sample from a list of integers 1:n, multiple times, where the probability vector (also length n) is different each time. 我正在寻找一种更有效的方法来从整数列表1：n中抽样，多次，其中概率向量（也是长度n）每次都不同。 For 20 trials with n = 10, I know one can do it like this: 对于n = 10的20次试验，我知道可以这样做：

probs <- matrix(runif(200), nrow = 20)
answers <- numeric(20)
for(i in 1:20) answers[i] <- sample(10,1,prob=probs[i,])

But that calls sample 10 times just to get a single number each time, so it's is presumably not the fastest way. 但是，每次调用样本10次只是为了得到一个数字，所以它可能不是最快的方式。 Speed would be helpful as the code will be doing this plenty of times. 速度会有所帮助，因为代码会这么做很多次。

Many thanks! 非常感谢！

Luke 卢克

Edit: Big thanks to Roman, whose idea about benchmarking helped me find a good solution. 编辑：非常感谢Roman，他对基准测试的想法帮助我找到了一个很好的解决方案。 I've now moved this to the answer. 我现在把它转到了答案。

Answer 1

Just for fun, I tried two more versions. 只是为了好玩，我尝试了两个版本。 On what scale are you doing this sampling? 你在做这个抽样的规模是多少？ I think all of these are pretty fast and more or less equivalent (I haven't included the creation of probs for your solution). 我认为所有这些都非常快，并且或多或少相当（我没有为您的解决方案创建probs）。 Would love to see others take a shot at this. 很想看到别人对此有所了解。

library(rbenchmark)
benchmark(replications = 1000,
          luke = for(i in 1:20) answers[i] <- sample(10,1,prob=probs[i,]),
          roman = apply(probs, MARGIN = 1, FUN = function(x) sample(10, 1, prob = x)),
          roman2 = replicate(20, sample(10, 1, prob = runif(10))))

    test replications elapsed relative user.self sys.self user.child sys.child
1   luke         1000    0.41    1.000      0.42        0         NA        NA
2  roman         1000    0.47    1.146      0.46        0         NA        NA
3 roman2         1000    0.47    1.146      0.44        0         NA        NA

Answer 2

Here's another approach that I found. 这是我找到的另一种方法。 It's fast, but not as fast as simply calling sample many times with a for loop. 它速度很快，但没有像使用for循环多次调用样本那么快。 I initially thought it was very good, but I was using benchmark() incorrectly. 我最初认为它非常好，但我错误地使用了基准（）。

luke2 = function(probs) { # takes a matrix of probability vectors, each in its own row
                probs <- probs/rowSums(probs) 
                probs <- t(apply(probs,1,cumsum)) 
                answer <- rowSums(probs - runif(nrow(probs)) < 0) + 1 
                return(answer)  }

Here's how it works: picture the probabilities as lines of various lengths laid out on a number line from 0 to 1. The big probabilities will take up more of the number line than the small ones. 以下是它的工作原理：将概率描述为从0到1的数字线上排列的各种长度的线。大概率的数字线路将占据数字线路的大部分。 You could then pick the outcome by picking a random point on the number line - the big probabilities will have more likelihood of being chosen. 然后，您可以通过在数字线上选择一个随机点来选择结果 - 大概率将更有可能被选中。 The advantage of this approach is that you can roll all the random numbers needed in one call of runif(), instead of calling sample over and over as in the functions luke, roman and roman2. 这种方法的优点是你可以滚动一次runif（）调用所需的所有随机数，而不是像函数luke，roman和roman2那样反复调用样本。 However, it looks like the extra data processing slows it down and the costs more than offset this benefit. 但是，看起来额外的数据处理速度会降低速度并且成本会抵消这一优势。

library(rbenchmark)
probs <- matrix(runif(2000), ncol = 10)
answers <- numeric(200)

benchmark(replications = 1000,
          luke = for(i in 1:20) answers[i] <- sample(10,1,prob=probs[i,]),
          luke2 = luke2(probs),
          roman = apply(probs, MARGIN = 1, FUN = function(x) sample(10, 1, prob = x)),
          roman2 = replicate(20, sample(10, 1, prob = runif(10))))
              roman = apply(probs, MARGIN = 1, FUN = function(x) sample(10, 1, prob = x)),
              roman2 = replicate(20, sample(10, 1, prob = runif(10))))

    test replications elapsed relative user.self sys.self user.child sys.child
    1   luke         1000   0.171    1.000     0.166    0.005          0         0
    2  luke2         1000   0.529    3.094     0.518    0.012          0         0
    3  roman         1000   1.564    9.146     1.513    0.052          0         0
    4 roman2         1000   0.225    1.316     0.213    0.012          0         0

For some reason, apply() does very badly as you add more rows. 出于某种原因，当您添加更多行时，apply（）会非常糟糕。 I don't understand why, because I thought it was a wrapper for for() and should therefore roman() should perform similarly to luke(). 我不明白为什么，因为我认为它是for（）的包装器，因此roman（）应该与luke（）类似地执行。

从不同概率向量中采样的有效方法

问题描述

2 个解决方案

解决方案1
2 2013-05-18 07:13:41

解决方案2
1 2013-05-20 06:36:15

从不同概率向量中采样的有效方法

问题描述

2 个解决方案

解决方案1 2 2013-05-18 07:13:41

解决方案2 1 2013-05-20 06:36:15

解决方案1
2 2013-05-18 07:13:41

解决方案2
1 2013-05-20 06:36:15