[英]R Randomly select a combination
Let say I have a vector of numbers x
that contains 10 numbers. 假设我有一个向量x
,其中包含10个数字。 I want to select a subset of N number M times and put it into a list object. 我想选择N次M次子集并将其放入列表对象。 How can I ensure what I pick is unique from the any elements within the list already? 如何确保我选择的内容与列表中的所有元素都是唯一的? Note that order does not matter. 注意顺序无关紧要。 So c(1,0,3)
is the same as c(3,0,1)
. 因此c(1,0,3)
与c(3,0,1)
相同。
x = seq(1,10,1)
I can do this with combn(x,N)
but in situation whens my x
contains 10k or more elements, using combn and randomly select any of the ones within is computation infeasible. 我可以使用combn(x,N)
来做到这一点,但是在我的x
包含10k或更多元素的情况下,使用combn并随机选择其中的任何一个都不可行。
Alternatively phrasing the question. 或者说问题。 I want to sample the output of combn(x,N)
randomly without replacement. 我想在不进行替换的情况下随机采样combn(x,N)
的输出。 Is it possible without generating calling combn
first? 是否可以不先生成调用combn
?
Any ideas? 有任何想法吗?
I want to sample the output of combn(x,N) randomly without replacement. 我想在不进行替换的情况下随机采样combn(x,N)的输出。 Is it possible without generating calling combn first? 是否可以不先生成调用combn?
I don't think so, not with the current state of 32-bit integers (and bit64
, even as good as it is, doesn't catch everything). 我不这么认为,不是具有32位整数的当前状态(而bit64
甚至bit64
不错,也无法捕获所有内容)。
Case in point: in order to be able to arbitrarily index the set returned by combn(10000,4)
, you probably start by determining something as straight-forward/simple as "is the first of my four numbers a '1'". combn(10000,4)
:为了能够任意索引combn(10000,4)
返回的combn(10000,4)
,您可能首先要确定一些简单明了的东西,例如“是我四个数字中的第一个为'1'”。 Knowing that the first j
iterations of a combination generator will start with 1 (eg, 1,2,3,4
, 1,2,3,5
, ..., 1,2,3,10000
), you think "all I need to do is check my desired index against this first set of 1s and iterate" (looking for 2s, 3s, etc). 明知第一j
组合产生的迭代将用1(例如,启动1,2,3,4
, 1,2,3,5
,......, 1,2,3,10000
),你认为“所有我需要做的是对照第一组1检查我想要的索引并进行迭代”(寻找2、3等)。 Unfortunately, with 10k and N=5
, the first 4.162501e+14
rows start with "1". 不幸的是,在10k和N=5
,前4.162501e+14
行以“ 1”开头。 (This happens to be choose(10000-1,5-1)
, which is not a coincidence.) You then have to do this again and again, and the count just goes higher. (这恰好是choose(10000-1,5-1)
,这不是巧合。)然后,您必须一次又一次地执行此操作,并且计数会增加。
This is well out of 32-bit integer-space. 这远远超出了32位整数空间。 N=6
escalates with 8.320840e+17
as the first set of 1s. N=6
以8.320840e+17
作为第一组1 8.320840e+17
。
To perform "random access" on this space is rather insane, something where even (I suspect) native 64bit calculations will run out of space fairly quickly. 要在此空间上执行“随机访问”是相当疯狂的,甚至(我怀疑)本机64位计算也将很快耗尽空间。
I believe your most practical route will be to use @alistaire's suggested code in his comment: 我相信您最实用的方法是在他的评论中使用@alistaire的建议代码:
set.seed(42)
x <- seq(1, 10000)
N <- 4
M <- 10
out <- list()
while(length(out) < M) {
out <- c(out,
unique(replicate(M - length(out), sort(sample(x, N)), simplify = FALSE)))
}
str(out)
# List of 10
# $ : int [1:4] 2861 8302 9149 9370
# $ : int [1:4] 1347 5191 6418 7365
# $ : int [1:4] 4577 6570 7050 7189
# $ : int [1:4] 2555 4623 9347 9398
# $ : int [1:4] 1175 4750 5602 9783
# $ : int [1:4] 1387 9041 9464 9887
# $ : int [1:4] 825 3902 5142 9055
# $ : int [1:4] 4470 7375 8109 8360
# $ : int [1:4] 40 3882 6852 8327
# $ : int [1:4] 74 2077 6116 9065
Or a slight adaptation (can be ~30% faster with much larger N
, M
): 或稍作调整(如果使用更大的N
, M
,速度可能会提高约30%):
set.seed(42)
N <- 8
M <- 100
out <- list()
while (length(out) < M) {
out2 <- split(apply(matrix(sample(lenx, size = M*N, replace = TRUE),
nrow = M, ncol = N),
1, sort),
rep(1:M, each = N))
out <- c(out, out2[ !duplicated(out2) ])
}
If you know that M*N < length(x)
, then you can use replace=FALSE
instead and you should be guaranteed a single pass through the while
loop. 如果您知道M*N < length(x)
,那么可以改用replace=FALSE
,这样就可以确保只通过一次while
循环。
I've written a function that provides random access to combinations. 我编写了一个函数,提供了对组合的随机访问。 However, the more I test it, the more I see that when it starts breaking down, it will do so without ensuring complete uniqueness of the indices and without a guarantee of erring when this happens. 但是,我测试的次数越多,我越会发现它开始崩溃时,这样做会不会确保索引的完全唯一性,也不会保证发生这种情况时会犯错误 。 (So I'm not posting it. I can provide it offline if somebody is really curious. I did something similar with a lazy expand.grid
, but that was mathematically much simpler/tractable; and even then I haven't tested it with sets this large. Since you are looking for combinations, not permutations, I don't think it fits here.) (因此,我没有发布它。如果有人真的很好奇,我可以脱机提供它。我对惰性的expand.grid
做了类似的expand.grid
,但是从数学expand.grid
这更简单/易于处理;甚至到那时我都没有用它进行测试设置得很大。由于您正在寻找组合而不是排列,因此我认为这不适合。)
Bottom line: R may not be the place for this, unfortunately. 底线:不幸的是,R可能不是这个地方。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.