[英]stratified sampling or proportional sampling in R
I have a data set generated as follows: 我有一个生成的数据集,如下所示:
myData <- data.frame(a=1:N,b=round(rnorm(N),2),group=round(rnorm(N,4),0))
The data looks like as this 数据如下所示
I would like to generate a stratified sample set of myData
with given sample size, ie, 50. The resulting sample set should follow the proportion allocation of the original data set in terms of "group". 我想使用给定的样本大小(即50)生成
myData
的分层样本集。所得样本集应遵循原始数据集按“组”的比例分配。 For instance, assume myData
has 20 records belonging to group 4, then the resulting data set should have 50*20/200=5
records belonging to group 4. How to do that in R. 例如,假设
myData
有20个属于组4的记录,那么结果数据集应具有50*20/200=5
属于组4的记录。如何在R中做到这一点。
You can use my stratified
function , specifying a value < 1 as your proportion, like this: 您可以使用我的
stratified
函数 ,将值<1指定为您的比例,如下所示:
## Sample data. Seed for reproducibility
set.seed(1)
N <- 50
myData <- data.frame(a=1:N,b=round(rnorm(N),2),group=round(rnorm(N,4),0))
## Taking the sample
out <- stratified(myData, "group", .3)
out
# a b group
# 17 17 -0.02 2
# 8 8 0.74 3
# 25 25 0.62 3
# 49 49 -0.11 3
# 4 4 1.60 3
# 26 26 -0.06 4
# 27 27 -0.16 4
# 7 7 0.49 4
# 12 12 0.39 4
# 40 40 0.76 4
# 32 32 -0.10 4
# 9 9 0.58 5
# 42 42 -0.25 5
# 43 43 0.70 5
# 37 37 -0.39 5
# 11 11 1.51 6
Compare the counts in the final group with what we would have expected. 将最后一组的计数与我们的预期进行比较。
round(table(myData$group) * .3)
#
# 2 3 4 5 6
# 1 4 6 4 1
table(out$group)
#
# 2 3 4 5 6
# 1 4 6 4 1
You can also easily take a fixed number of samples per group, like this: 您还可以轻松地每组固定数量的样本,如下所示:
stratified(myData, "group", 2)
# a b group
# 34 34 -0.05 2
# 17 17 -0.02 2
# 49 49 -0.11 3
# 22 22 0.78 3
# 12 12 0.39 4
# 7 7 0.49 4
# 18 18 0.94 5
# 33 33 0.39 5
# 45 45 -0.69 6
# 11 11 1.51 6
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.