简体   繁体   中英

R Randomly select a combination

Let say I have a vector of numbers x that contains 10 numbers. I want to select a subset of N number M times and put it into a list object. How can I ensure what I pick is unique from the any elements within the list already? Note that order does not matter. So c(1,0,3) is the same as c(3,0,1) .

x = seq(1,10,1)

I can do this with combn(x,N) but in situation whens my x contains 10k or more elements, using combn and randomly select any of the ones within is computation infeasible.

Alternatively phrasing the question. I want to sample the output of combn(x,N) randomly without replacement. Is it possible without generating calling combn first?

Any ideas?

I want to sample the output of combn(x,N) randomly without replacement. Is it possible without generating calling combn first?

I don't think so, not with the current state of 32-bit integers (and bit64 , even as good as it is, doesn't catch everything).

Case in point: in order to be able to arbitrarily index the set returned by combn(10000,4) , you probably start by determining something as straight-forward/simple as "is the first of my four numbers a '1'". Knowing that the first j iterations of a combination generator will start with 1 (eg, 1,2,3,4 , 1,2,3,5 , ..., 1,2,3,10000 ), you think "all I need to do is check my desired index against this first set of 1s and iterate" (looking for 2s, 3s, etc). Unfortunately, with 10k and N=5 , the first 4.162501e+14 rows start with "1". (This happens to be choose(10000-1,5-1) , which is not a coincidence.) You then have to do this again and again, and the count just goes higher.

This is well out of 32-bit integer-space. N=6 escalates with 8.320840e+17 as the first set of 1s.

To perform "random access" on this space is rather insane, something where even (I suspect) native 64bit calculations will run out of space fairly quickly.

If you cannot reduce your data size

I believe your most practical route will be to use @alistaire's suggested code in his comment:

set.seed(42)
x <- seq(1, 10000)
N <- 4
M <- 10
out <- list()
while(length(out) < M) {
  out <- c(out,
           unique(replicate(M - length(out), sort(sample(x, N)), simplify = FALSE)))
}
str(out)
# List of 10
#  $ : int [1:4] 2861 8302 9149 9370
#  $ : int [1:4] 1347 5191 6418 7365
#  $ : int [1:4] 4577 6570 7050 7189
#  $ : int [1:4] 2555 4623 9347 9398
#  $ : int [1:4] 1175 4750 5602 9783
#  $ : int [1:4] 1387 9041 9464 9887
#  $ : int [1:4] 825 3902 5142 9055
#  $ : int [1:4] 4470 7375 8109 8360
#  $ : int [1:4] 40 3882 6852 8327
#  $ : int [1:4] 74 2077 6116 9065

Or a slight adaptation (can be ~30% faster with much larger N , M ):

set.seed(42)
N <- 8
M <- 100
out <- list()
while (length(out) < M) {
  out2 <- split(apply(matrix(sample(lenx, size = M*N, replace = TRUE),
                            nrow = M, ncol = N),
                     1, sort),
                rep(1:M, each = N))
  out <- c(out, out2[ !duplicated(out2) ])
}

If you know that M*N < length(x) , then you can use replace=FALSE instead and you should be guaranteed a single pass through the while loop.

Even if you can significantly reduce your sample size

I've written a function that provides random access to combinations. However, the more I test it, the more I see that when it starts breaking down, it will do so without ensuring complete uniqueness of the indices and without a guarantee of erring when this happens. (So I'm not posting it. I can provide it offline if somebody is really curious. I did something similar with a lazy expand.grid , but that was mathematically much simpler/tractable; and even then I haven't tested it with sets this large. Since you are looking for combinations, not permutations, I don't think it fits here.)

Bottom line: R may not be the place for this, unfortunately.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM