R随机选择一个组合

Question

Let say I have a vector of numbers x that contains 10 numbers. 假设我有一个向量x ，其中包含10个数字。 I want to select a subset of N number M times and put it into a list object. 我想选择N次M次子集并将其放入列表对象。 How can I ensure what I pick is unique from the any elements within the list already? 如何确保我选择的内容与列表中的所有元素都是唯一的？ Note that order does not matter. 注意顺序无关紧要。 So c(1,0,3) is the same as c(3,0,1) . 因此c(1,0,3)与c(3,0,1)相同。

x = seq(1,10,1)

I can do this with combn(x,N) but in situation whens my x contains 10k or more elements, using combn and randomly select any of the ones within is computation infeasible. 我可以使用combn(x,N)来做到这一点，但是在我的x包含10k或更多元素的情况下，使用combn并随机选择其中的任何一个都不可行。

Alternatively phrasing the question. 或者说问题。 I want to sample the output of combn(x,N) randomly without replacement. 我想在不进行替换的情况下随机采样combn(x,N)的输出。 Is it possible without generating calling combn first? 是否可以不先生成调用combn ？

Any ideas? 有任何想法吗？

Answer 1

I want to sample the output of combn(x,N) randomly without replacement. 我想在不进行替换的情况下随机采样combn（x，N）的输出。 Is it possible without generating calling combn first? 是否可以不先生成调用combn？

I don't think so, not with the current state of 32-bit integers (and bit64 , even as good as it is, doesn't catch everything). 我不这么认为，不是具有32位整数的当前状态（而bit64甚至bit64不错，也无法捕获所有内容）。

Case in point: in order to be able to arbitrarily index the set returned by combn(10000,4) , you probably start by determining something as straight-forward/simple as "is the first of my four numbers a '1'". combn(10000,4) ：为了能够任意索引combn(10000,4)返回的combn(10000,4) ，您可能首先要确定一些简单明了的东西，例如“是我四个数字中的第一个为'1'”。 Knowing that the first j iterations of a combination generator will start with 1 (eg, 1,2,3,4 , 1,2,3,5 , ..., 1,2,3,10000 ), you think "all I need to do is check my desired index against this first set of 1s and iterate" (looking for 2s, 3s, etc). 明知第一j组合产生的迭代将用1（例如，启动1,2,3,4 ， 1,2,3,5 ，......， 1,2,3,10000 ），你认为“所有我需要做的是对照第一组1检查我想要的索引并进行迭代”（寻找2、3等）。 Unfortunately, with 10k and N=5 , the first 4.162501e+14 rows start with "1". 不幸的是，在10k和N=5 ，前4.162501e+14行以“ 1”开头。 (This happens to be choose(10000-1,5-1) , which is not a coincidence.) You then have to do this again and again, and the count just goes higher. （这恰好是choose(10000-1,5-1) ，这不是巧合。）然后，您必须一次又一次地执行此操作，并且计数会增加。

This is well out of 32-bit integer-space. 这远远超出了32位整数空间。 N=6 escalates with 8.320840e+17 as the first set of 1s. N=6以8.320840e+17作为第一组1 8.320840e+17 。

To perform "random access" on this space is rather insane, something where even (I suspect) native 64bit calculations will run out of space fairly quickly. 要在此空间上执行“随机访问”是相当疯狂的，甚至（我怀疑）本机64位计算也将很快耗尽空间。

If you cannot reduce your data size 如果无法减少数据量

I believe your most practical route will be to use @alistaire's suggested code in his comment: 我相信您最实用的方法是在他的评论中使用@alistaire的建议代码：

set.seed(42)
x <- seq(1, 10000)
N <- 4
M <- 10
out <- list()
while(length(out) < M) {
  out <- c(out,
           unique(replicate(M - length(out), sort(sample(x, N)), simplify = FALSE)))
}
str(out)
# List of 10
#  $ : int [1:4] 2861 8302 9149 9370
#  $ : int [1:4] 1347 5191 6418 7365
#  $ : int [1:4] 4577 6570 7050 7189
#  $ : int [1:4] 2555 4623 9347 9398
#  $ : int [1:4] 1175 4750 5602 9783
#  $ : int [1:4] 1387 9041 9464 9887
#  $ : int [1:4] 825 3902 5142 9055
#  $ : int [1:4] 4470 7375 8109 8360
#  $ : int [1:4] 40 3882 6852 8327
#  $ : int [1:4] 74 2077 6116 9065

Or a slight adaptation (can be ~30% faster with much larger N , M ): 或稍作调整（如果使用更大的N ， M ，速度可能会提高约30％）：

set.seed(42)
N <- 8
M <- 100
out <- list()
while (length(out) < M) {
  out2 <- split(apply(matrix(sample(lenx, size = M*N, replace = TRUE),
                            nrow = M, ncol = N),
                     1, sort),
                rep(1:M, each = N))
  out <- c(out, out2[ !duplicated(out2) ])
}

If you know that M*N < length(x) , then you can use replace=FALSE instead and you should be guaranteed a single pass through the while loop. 如果您知道M*N < length(x) ，那么可以改用replace=FALSE ，这样就可以确保只通过一次while循环。

Even if you can significantly* reduce your sample size* 即使可以大大减少样本量

I've written a function that provides random access to combinations. 我编写了一个函数，提供了对组合的随机访问。 However, the more I test it, the more I see that when it starts breaking down, it will do so without ensuring complete uniqueness of the indices and without a guarantee of erring when this happens. 但是，我测试的次数越多，我越会发现它开始崩溃时，这样做会不会确保索引的完全唯一性，也不会保证发生这种情况时会犯错误 。 (So I'm not posting it. I can provide it offline if somebody is really curious. I did something similar with a lazy expand.grid , but that was mathematically much simpler/tractable; and even then I haven't tested it with sets this large. Since you are looking for combinations, not permutations, I don't think it fits here.) （因此，我没有发布它。如果有人真的很好奇，我可以脱机提供它。我对惰性的expand.grid做了类似的expand.grid ，但是从数学expand.grid这更简单/易于处理；甚至到那时我都没有用它进行测试设置得很大。由于您正在寻找组合而不是排列，因此我认为这不适合。）

Bottom line: R may not be the place for this, unfortunately. 底线：不幸的是，R可能不是这个地方。

R随机选择一个组合

问题描述

1 个解决方案

解决方案1
2 2016-12-04 09:51:26

If you cannot reduce your data size 如果无法减少数据量

Even if you can significantly* reduce your sample size* 即使可以大大减少样本量

R随机选择一个组合

问题描述

1 个解决方案

解决方案1 2 2016-12-04 09:51:26

If you cannot reduce your data size 如果无法减少数据量

Even if you can significantly reduce your sample size 即使可以大大减少样本量

解决方案1
2 2016-12-04 09:51:26

Even if you can significantly* reduce your sample size* 即使可以大大减少样本量