Let say I have a vector of numbers x
that contains 10 numbers. I want to select a subset of N number M times and put it into a list object. How can I ensure what I pick is unique from the any elements within the list already? Note that order does not matter. So c(1,0,3)
is the same as c(3,0,1)
.
x = seq(1,10,1)
I can do this with combn(x,N)
but in situation whens my x
contains 10k or more elements, using combn and randomly select any of the ones within is computation infeasible.
Alternatively phrasing the question. I want to sample the output of combn(x,N)
randomly without replacement. Is it possible without generating calling combn
first?
Any ideas?
I want to sample the output of combn(x,N) randomly without replacement. Is it possible without generating calling combn first?
I don't think so, not with the current state of 32-bit integers (and bit64
, even as good as it is, doesn't catch everything).
Case in point: in order to be able to arbitrarily index the set returned by combn(10000,4)
, you probably start by determining something as straight-forward/simple as "is the first of my four numbers a '1'". Knowing that the first j
iterations of a combination generator will start with 1 (eg, 1,2,3,4
, 1,2,3,5
, ..., 1,2,3,10000
), you think "all I need to do is check my desired index against this first set of 1s and iterate" (looking for 2s, 3s, etc). Unfortunately, with 10k and N=5
, the first 4.162501e+14
rows start with "1". (This happens to be choose(10000-1,5-1)
, which is not a coincidence.) You then have to do this again and again, and the count just goes higher.
This is well out of 32-bit integer-space. N=6
escalates with 8.320840e+17
as the first set of 1s.
To perform "random access" on this space is rather insane, something where even (I suspect) native 64bit calculations will run out of space fairly quickly.
I believe your most practical route will be to use @alistaire's suggested code in his comment:
set.seed(42)
x <- seq(1, 10000)
N <- 4
M <- 10
out <- list()
while(length(out) < M) {
out <- c(out,
unique(replicate(M - length(out), sort(sample(x, N)), simplify = FALSE)))
}
str(out)
# List of 10
# $ : int [1:4] 2861 8302 9149 9370
# $ : int [1:4] 1347 5191 6418 7365
# $ : int [1:4] 4577 6570 7050 7189
# $ : int [1:4] 2555 4623 9347 9398
# $ : int [1:4] 1175 4750 5602 9783
# $ : int [1:4] 1387 9041 9464 9887
# $ : int [1:4] 825 3902 5142 9055
# $ : int [1:4] 4470 7375 8109 8360
# $ : int [1:4] 40 3882 6852 8327
# $ : int [1:4] 74 2077 6116 9065
Or a slight adaptation (can be ~30% faster with much larger N
, M
):
set.seed(42)
N <- 8
M <- 100
out <- list()
while (length(out) < M) {
out2 <- split(apply(matrix(sample(lenx, size = M*N, replace = TRUE),
nrow = M, ncol = N),
1, sort),
rep(1:M, each = N))
out <- c(out, out2[ !duplicated(out2) ])
}
If you know that M*N < length(x)
, then you can use replace=FALSE
instead and you should be guaranteed a single pass through the while
loop.
I've written a function that provides random access to combinations. However, the more I test it, the more I see that when it starts breaking down, it will do so without ensuring complete uniqueness of the indices and without a guarantee of erring when this happens. (So I'm not posting it. I can provide it offline if somebody is really curious. I did something similar with a lazy expand.grid
, but that was mathematically much simpler/tractable; and even then I haven't tested it with sets this large. Since you are looking for combinations, not permutations, I don't think it fits here.)
Bottom line: R may not be the place for this, unfortunately.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.