简体   繁体   中英

Generating random sample data in R with specified sample size and probability

I want to use R to write a model that will answer a general question about probability. The general question is below, followed by my specific questions about how to answer it using R code. If you know the answer to the general question (separate from the R code), and can explain the underlying statistical principles in plain English, I'm interested in that too!

Question: If I split a group of n objects, first through a 4-way splitter, then through a 7-way splitter (resulting in a total of 28 distinct groups), and each splitter results in a random distribution (ie the objects are split approximately equally), does the order of the splits impact the variance of the final 28 groups. If I split into 4 and then into 7, is that different than splitting into 7 and then into 4? Does the answer change if one splitter has greater variance than the other?

Specific R question: how can I write a model to answer this question? So far, I've tried using sample and rnorm to generate sample data. Simulating a 4-way splitter would look something like this:

sample(1:4, size=100000, replace=TRUE)

This is basically like rolling a 4-sided die 100,000 times and recording the number of instances of each number. I can use the table function to sum the instances, which gives me an output like this:

> table(sample(1:4, size=100000, replace=TRUE))

    1     2     3     4 
25222 24790 25047 24941

Now, I want to take each of those outputs and use them as the input for a 7-way split. I tried saving the 4-way split as a variable and then plugging that vector in the the size = variable like this:

Split4way <- as.vector(table(sample(1:4, size=100000, replace=TRUE)))
as.vector(table(sample(1:7, size=Split4Way, replace=TRUE)))

But when I do that, instead of a matrix with 4 rows and 7 columns, I just get a vector with 1 row and 7 columns. It appears that "size" variable for the 7-way split only uses 1 of the 4 outputs from the 4-way split instead of using each of them.

> as.vector(table(sample(1:7, size = Split4up, replace=TRUE)))
[1] 3527 3570 3527 3511 3550 3480 3588

So, how can I generate a table or list that shows all the outputs of the 4-way split followed by the 7-way split, for a total of 28 splits?

AND

Is there a function that will allow me to customize the standard deviation of each splitting device? For example, can I dictate that the outputs of the 4-way splitter have a standard deviation of x%, and the outputs of the 7-way splitter have a standard deviation of x%?

We can illustrate your set-up by writing a function that will simulate n objects being passed into the splitters.

Imagine the object comes first to the 4-splitter. Let us randomly assign it a number from one to four to determine which way it is split. Next it comes to a seven splitter; we can also randomly assign it a number from one to seven to determine which final bin it will end up in.

The set up looks like this:

                                    Final bins

1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
1  2  3  4  5  6  7  1  2  3  4  5  6  7  1  2  3  4  5  6  7  1  2  3  4  5  6  7  
\__|__|__|__|__|_/   \__|__|__|__|__|_/   \__|__|__|__|__|_/   \__|__|__|__|__|_/  
        |                    |                    |                    |
  seven splitter       seven splitter       seven splitter      seven splitter         
        |                    |                    |                    |
        1                    2                    3                    4
         \___________________|____________________|___________________/
                                        |
                                   four splitter
                                        |
                                      input

We can see that any unique pair of numbers will cause the object to end up in a different bin.

For the second set-up, we reverse the order, so that the seven splitter comes first, but otherwise each object still gets a unique bin based on a unique pair of numbers:

1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
1  2  3  4  1  2  3  4  1  2  3  4  1  2  3  4  1  2  3  4  1  2  3  4  1  2  3  4   
\__|__|__/  \__|__|__/  \__|__|__/  \__|__|__/  \__|__|__/  \__|__|__/  \__|__|__/  
     |           |           |           |           |           |           |
4 splitter  4 splitter  4 splitter  4 splitter  4 splitter  4 splitter  4 splitter 
     |           |           |           |           |           |           |
     1           2           3           4           5           6           7
      \__________|___________|___________|___________|___________|__________/
                                         |
                                     7 splitter
                                         |
                                       input

Note that we can either draw a random 1:4 then a random 1:7, or vice versa, but in either case the unique pair will determine a unique bin. The actual bin the object ends up in will change depending on the order in which the two numbers are applied, but this will not change the fact that each bin will get 1/28 of the objects passed in, and the variance will remain the same.

That means to simulate and compare the two set ups, we need only sample from 1:4 and 1:7 for each object passed in, then apply the two numbers in a different order to calculate the final bin:

simulate <- function(n) {
  df <- data.frame(fours  = sample(4, n, replace = TRUE),
                   sevens = sample(7, n, replace = TRUE))
  df$four_then_seven <- 7 * (df$fours - 1) + df$sevens
  df$seven_then_four <- 4 * (df$sevens - 1) + df$fours
  return(df)
}

So let's examine how this would play out for 10 objects passed in:

set.seed(69) # Makes the example reproducible

simulate(10)
#>    fours sevens four_then_seven seven_then_four
#> 1      4      6              27              24
#> 2      1      5               5              17
#> 3      3      7              21              27
#> 4      2      2               9               6
#> 5      4      2              23               8
#> 6      4      3              24              12
#> 7      1      4               4              13
#> 8      3      2              16               7
#> 9      3      7              21              27
#> 10     3      2              16               7

Now let's do a table of the quantities in each bin if we had 100,000 draws:

s <- simulate(100000)

seven_four <- table(s$seven_then_four)
seven_four
#> 
#>    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
#> 3434 3607 3539 3447 3512 3628 3564 3522 3540 3539 3544 3524 3552 3644 3626 3578 
#>   17   18   19   20   21   22   23   24   25   26   27   28 
#> 3609 3616 3673 3617 3654 3637 3542 3624 3568 3651 3486 3523

four_seven <- table(s$four_then_seven)
four_seven
#> 
#>    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
#> 3434 3512 3540 3552 3609 3654 3568 3607 3628 3539 3644 3616 3637 3651 3539 3564 
#>   17   18   19   20   21   22   23   24   25   26   27   28 
#> 3544 3626 3673 3542 3486 3447 3522 3524 3578 3617 3624 3523

If you sort these two tables from smallest number to largest number in each bin, you will see they are actually identical apart from the labels on their bins. The distribution of counts is completely unchanged. This means the variance / standard deviation is also the same in both cases:

var(four_seven)
#> [1] 3931.439

var(seven_four)
#> [1] 3931.439

The only way to change the variance / standard deviation is to "fix" the splitters so they do not have an equal probability.

I'm also struggling to interpret your use of variance and standard deviation. the best I can think of is doing this "splitting" non-uniformly

as an alternative to Allan's code, you could generate non-uniform samples by doing:

# how should the alternatives be weighted (normalised probability is also OK)
a <- c(1, 2, 3, 4)  # i.e. last four times as much as first
b <- c(1, 1, 2, 2, 3, 3, 4)

x <- sample(28, 10000, prob=a %*% t(b), replace=TRUE)

note that prob is automatically normalised (ie by dividing by the sum) in sample . you can check that things are working with:

  • table((x-1) %% 4 + 1) should be close to a/sum(a) * 10000
  • table((x-1) %/% 4 + 1) should be close to b/sum(b) * 10000

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM