简体   繁体   English

在 R 中生成具有指定样本大小和概率的随机样本数据

[英]Generating random sample data in R with specified sample size and probability

I want to use R to write a model that will answer a general question about probability.我想用 R 编写一个模型来回答关于概率的一般问题。 The general question is below, followed by my specific questions about how to answer it using R code.下面是一般问题,然后是我关于如何使用 R 代码回答它的具体问题。 If you know the answer to the general question (separate from the R code), and can explain the underlying statistical principles in plain English, I'm interested in that too!如果您知道一般问题的答案(与 R 代码分开),并且可以用简单的英语解释基本的统计原理,我也对此感兴趣!

Question: If I split a group of n objects, first through a 4-way splitter, then through a 7-way splitter (resulting in a total of 28 distinct groups), and each splitter results in a random distribution (ie the objects are split approximately equally), does the order of the splits impact the variance of the final 28 groups.问题:如果我拆分一组 n 个对象,首先通过 4 路拆分器,然后通过 7 路拆分器(导致总共 28 个不同的组),每个拆分器导致随机分布(即对象是拆分大致相等),拆分的顺序是否会影响最后 28 个组的方差。 If I split into 4 and then into 7, is that different than splitting into 7 and then into 4?如果我分成4个然后分成7个,那和分成7个然后分成4个有区别吗? Does the answer change if one splitter has greater variance than the other?如果一个分割器的方差比另一个大,答案会改变吗?

Specific R question: how can I write a model to answer this question?具体 R 问题:我如何编写模型来回答这个问题? So far, I've tried using sample and rnorm to generate sample data.到目前为止,我已经尝试使用samplernorm来生成样本数据。 Simulating a 4-way splitter would look something like this:模拟 4 路分配器看起来像这样:

sample(1:4, size=100000, replace=TRUE)

This is basically like rolling a 4-sided die 100,000 times and recording the number of instances of each number.这基本上就像滚动 4 面骰子 100,000 次并记录每个数字的实例数。 I can use the table function to sum the instances, which gives me an output like this:我可以使用table函数对实例求和,这给了我这样的输出:

> table(sample(1:4, size=100000, replace=TRUE))

    1     2     3     4 
25222 24790 25047 24941

Now, I want to take each of those outputs and use them as the input for a 7-way split.现在,我想获取这些输出中的每一个并将它们用作 7 路拆分的输入 I tried saving the 4-way split as a variable and then plugging that vector in the the size = variable like this:我尝试将 4 路拆分保存为变量,然后将该向量插入size =变量,如下所示:

Split4way <- as.vector(table(sample(1:4, size=100000, replace=TRUE)))
as.vector(table(sample(1:7, size=Split4Way, replace=TRUE)))

But when I do that, instead of a matrix with 4 rows and 7 columns, I just get a vector with 1 row and 7 columns.但是当我这样做时,我得到的不是一个 4 行 7 列的矩阵,而是一个 1 行 7 列的向量。 It appears that "size" variable for the 7-way split only uses 1 of the 4 outputs from the 4-way split instead of using each of them.似乎 7 路拆分的“大小”变量仅使用 4 路拆分的 4 个输出中的 1 个,而不是使用它们中的一个。

> as.vector(table(sample(1:7, size = Split4up, replace=TRUE)))
[1] 3527 3570 3527 3511 3550 3480 3588

So, how can I generate a table or list that shows all the outputs of the 4-way split followed by the 7-way split, for a total of 28 splits?那么,我如何生成一个表格或列表来显示 4 路拆分和 7 路拆分的所有输出,总共 28 个拆分?

AND

Is there a function that will allow me to customize the standard deviation of each splitting device?是否有一个功能可以让我自定义每个拆分设备的标准偏差? For example, can I dictate that the outputs of the 4-way splitter have a standard deviation of x%, and the outputs of the 7-way splitter have a standard deviation of x%?例如,我能否规定 4 路分配器的输出具有 x% 的标准偏差,而 7 路分配器的输出具有 x% 的标准偏差?

We can illustrate your set-up by writing a function that will simulate n objects being passed into the splitters.我们可以通过编写一个函数来说明您的设置,该函数将模拟传递给拆分器的n对象。

Imagine the object comes first to the 4-splitter.想象一下,对象首先出现在 4-splitter 中。 Let us randomly assign it a number from one to four to determine which way it is split.让我们随机给它分配一个从 1 到 4 的数字,以确定它的分割方式。 Next it comes to a seven splitter;接下来是七分路器; we can also randomly assign it a number from one to seven to determine which final bin it will end up in.我们还可以随机给它分配一个从 1 到 7 的数字,以确定它最终会进入哪个 bin。

The set up looks like this:设置如下所示:

                                    Final bins

1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
1  2  3  4  5  6  7  1  2  3  4  5  6  7  1  2  3  4  5  6  7  1  2  3  4  5  6  7  
\__|__|__|__|__|_/   \__|__|__|__|__|_/   \__|__|__|__|__|_/   \__|__|__|__|__|_/  
        |                    |                    |                    |
  seven splitter       seven splitter       seven splitter      seven splitter         
        |                    |                    |                    |
        1                    2                    3                    4
         \___________________|____________________|___________________/
                                        |
                                   four splitter
                                        |
                                      input

We can see that any unique pair of numbers will cause the object to end up in a different bin.我们可以看到,任何唯一的一对数字都会导致对象最终进入不同的 bin。

For the second set-up, we reverse the order, so that the seven splitter comes first, but otherwise each object still gets a unique bin based on a unique pair of numbers:对于第二个设置,我们颠倒顺序,使 7 个分离器排在最前面,但除此之外,每个对象仍然会根据一对唯一的数字获得一个唯一的 bin:

1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
1  2  3  4  1  2  3  4  1  2  3  4  1  2  3  4  1  2  3  4  1  2  3  4  1  2  3  4   
\__|__|__/  \__|__|__/  \__|__|__/  \__|__|__/  \__|__|__/  \__|__|__/  \__|__|__/  
     |           |           |           |           |           |           |
4 splitter  4 splitter  4 splitter  4 splitter  4 splitter  4 splitter  4 splitter 
     |           |           |           |           |           |           |
     1           2           3           4           5           6           7
      \__________|___________|___________|___________|___________|__________/
                                         |
                                     7 splitter
                                         |
                                       input

Note that we can either draw a random 1:4 then a random 1:7, or vice versa, but in either case the unique pair will determine a unique bin.请注意,我们可以得出一个随机1:4然后随机1:7,或反之亦然,但在两种情况下,唯一对将确定唯一箱。 The actual bin the object ends up in will change depending on the order in which the two numbers are applied, but this will not change the fact that each bin will get 1/28 of the objects passed in, and the variance will remain the same.对象最终进入的实际 bin 将根据应用两个数字的顺序而变化,但这不会改变这样一个事实,即每个 bin 将获得传入对象的 1/28,并且方差将保持不变.

That means to simulate and compare the two set ups, we need only sample from 1:4 and 1:7 for each object passed in, then apply the two numbers in a different order to calculate the final bin:这意味着要模拟和比较两个设置,我们只需要对传入的每个对象从 1:4 和 1:7 采样,然后以不同的顺序应用这两个数字来计算最终的 bin:

simulate <- function(n) {
  df <- data.frame(fours  = sample(4, n, replace = TRUE),
                   sevens = sample(7, n, replace = TRUE))
  df$four_then_seven <- 7 * (df$fours - 1) + df$sevens
  df$seven_then_four <- 4 * (df$sevens - 1) + df$fours
  return(df)
}

So let's examine how this would play out for 10 objects passed in:那么让我们来看看这对于传入的 10 个对象是如何发挥作用的:

set.seed(69) # Makes the example reproducible

simulate(10)
#>    fours sevens four_then_seven seven_then_four
#> 1      4      6              27              24
#> 2      1      5               5              17
#> 3      3      7              21              27
#> 4      2      2               9               6
#> 5      4      2              23               8
#> 6      4      3              24              12
#> 7      1      4               4              13
#> 8      3      2              16               7
#> 9      3      7              21              27
#> 10     3      2              16               7

Now let's do a table of the quantities in each bin if we had 100,000 draws:如果我们有 100,000 次抽奖,现在让我们制作一张每个箱中数量的表格:

s <- simulate(100000)

seven_four <- table(s$seven_then_four)
seven_four
#> 
#>    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
#> 3434 3607 3539 3447 3512 3628 3564 3522 3540 3539 3544 3524 3552 3644 3626 3578 
#>   17   18   19   20   21   22   23   24   25   26   27   28 
#> 3609 3616 3673 3617 3654 3637 3542 3624 3568 3651 3486 3523

four_seven <- table(s$four_then_seven)
four_seven
#> 
#>    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
#> 3434 3512 3540 3552 3609 3654 3568 3607 3628 3539 3644 3616 3637 3651 3539 3564 
#>   17   18   19   20   21   22   23   24   25   26   27   28 
#> 3544 3626 3673 3542 3486 3447 3522 3524 3578 3617 3624 3523

If you sort these two tables from smallest number to largest number in each bin, you will see they are actually identical apart from the labels on their bins.如果您将这两个表从每个 bin 中的最小数字到最大数字进行排序,您会发现它们实际上是相同的,除了它们 bin 上的标签。 The distribution of counts is completely unchanged.计数的分布完全没有变化。 This means the variance / standard deviation is also the same in both cases:这意味着方差/标准偏差在两种情况下也是相同的:

var(four_seven)
#> [1] 3931.439

var(seven_four)
#> [1] 3931.439

The only way to change the variance / standard deviation is to "fix" the splitters so they do not have an equal probability.改变方差/标准偏差的唯一方法是“修复”拆分器,使它们的概率相等。

I'm also struggling to interpret your use of variance and standard deviation.我也在努力解释您对方差和标准偏差的使用。 the best I can think of is doing this "splitting" non-uniformly我能想到的最好的方法是不均匀地进行这种“分裂”

as an alternative to Allan's code, you could generate non-uniform samples by doing:作为艾伦代码的替代方案,您可以通过执行以下操作生成非均匀样本:

# how should the alternatives be weighted (normalised probability is also OK)
a <- c(1, 2, 3, 4)  # i.e. last four times as much as first
b <- c(1, 1, 2, 2, 3, 3, 4)

x <- sample(28, 10000, prob=a %*% t(b), replace=TRUE)

note that prob is automatically normalised (ie by dividing by the sum) in sample .请注意, probsample自动归一化(即通过除以总和)。 you can check that things are working with:您可以检查事情是否正在处理:

  • table((x-1) %% 4 + 1) should be close to a/sum(a) * 10000 table((x-1) %% 4 + 1)应该接近a/sum(a) * 10000
  • table((x-1) %/% 4 + 1) should be close to b/sum(b) * 10000 table((x-1) %/% 4 + 1)应该接近b/sum(b) * 10000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM