简体   繁体   English

为分层抽样准备抽样分布的最佳 R 包函数

[英]Best R package function to prepare sampling distribution for stratified sampling

I'm attempting to prepare a demonstration in R of how the repeated stratified random sampling of a small population results in a near-normal sampling distribution of means.我试图在 R 中准备一个演示,说明对小群体的重复分层随机抽样如何导致均值的接近正态抽样分布。 As an example consider the R code below (which works but is very slow due to looping).作为一个例子,考虑下面的 R 代码(它工作但由于循环而非常慢)。

#Dummy population made up of dice throws - 18 per row
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
P1 <- as.data.frame(c(5,6,5,1,6,4,2,2,4,4,6,6,5,2,3,5,1,6))
P1$Zn <- 1
names(P1) <- c('Die','Zn')
Dt <- P1

P2 <- as.data.frame(c(2,5,4,5,5,5,3,3,2,5,6,1,2,5,4,3,6,1))
P2$Zn <- 2
names(P2) <- c('Die','Zn')
Dt <- rbind(Dt,P2)

# Empty dataframe to hold random draws
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Smps <- data.frame(Die = numeric(), Zn= numeric(),Drw = numeric())

# Draw stratifed samples one from each row
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
print(paste('Start','at',Sys.time()))
n <- 10000          # number of draws
r <- 2              # number of rows (the strata)
for (j in 1:n){
  # for a 2 strata
  for (i in 1:r){
    #sub set strata
    x <- subset(Dt, Dt$Zn == i)
    # random sample
    y <- x[sample(1:18,1),]
    y$Drw <- j
    #append sample
    Smps <- rbind(Smps,y)
  }
  # report progress
  if(right(j,3) == '000'){
    print(paste(j,'at',Sys.time()))
    flush.console()
  }
}

# Compute the sample means
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Mns <-aggregate(Smps[, 1], list(Smps$Drw), mean)

# Density plot of means
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
d <- density(Mns$x)
plot(d,xlab = 'Means', las=1, main = '')
polygon(d, col="blue", border="blue")

I'm expecting there is an R package with a function that does this type of stratified sampling but I am struggling to find the one that works in the manner I can understand.我期待有一个 R 包,它的功能可以进行这种类型的分层采样,但我正在努力寻找以我能理解的方式工作的包。 Something that inputs a data frame with a grouping field and the number of samples to be drawn from each group is something I'm expecting has already been written to allow a repeat sampling by a group.输入带有分组字段的数据帧和要从每个组中抽取的样本数量的东西是我期望已经写入的东西,以允许一组重复采样。 Any pointers to examples that work would be appreciated.任何指向工作示例的指针将不胜感激。 Ideally, I would like to prepare to say 100,000 stratified samples from a known population with many more strata and then plot the distribution of the means (but quickly)理想情况下,我想准备从具有更多层的已知总体中抽取 100,000 个分层样本,然后绘制均值分布(但很快)

After a while away from this problem, I found a package called 'fifer' ( https://www.rdocumentation.org/packages/fifer/versions/1.1 ) which seem contained a stratified function in a package but unfortunately, this package does not work on the latest versions of R. I did, however, find a clever stratified function from Ananda Mahto ( https://gist.github.com/mrdwab/6424112 ) which works well but at the cost of having a rather long function in your script rather than the one line of loading a package.离开这个问题一段时间后,我发现了一个名为 'fifer' ( https://www.rdocumentation.org/packages/fifer/versions/1.1 ) 的包,它似乎在包中包含一个分层功能,但不幸的是,这个包确实不适用于最新版本的 R。但是,我确实从 Ananda Mahto ( https://gist.github.com/mrdwab/6424112 ) 找到了一个巧妙的分层函数,该函数运行良好,但代价是函数相当长在您的脚本中,而不是加载包的一行。 My solution to the problem above using this function is below.我使用此功能解决上述问题的方法如下。

#Dummy population made up of dice throws - 18 per row
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
P1 <- as.data.frame(c(5,6,5,1,6,4,2,2,4,4,6,6,5,2,3,5,1,6))
P1$Zn <- 1
names(P1) <- c('Die','Zn')
Dt <- P1

P2 <- as.data.frame(c(2,5,4,5,5,5,3,3,2,5,6,1,2,5,4,3,6,1))
P2$Zn <- 2
names(P2) <- c('Die','Zn')
Dt <- rbind(Dt,P2)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Stratfed function from web
# https://gist.github.com/mrdwab/6424112
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

stratified <- function(df, group, size, select = NULL, 
                       replace = FALSE, bothSets = FALSE) {
  if (is.null(select)) {
    df <- df
  } else {
    if (is.null(names(select))) stop("'select' must be a named list")
    if (!all(names(select) %in% names(df)))
      stop("Please verify your 'select' argument")
    temp <- sapply(names(select),
                   function(x) df[[x]] %in% select[[x]])
    df <- df[rowSums(temp) == length(select), ]
  }
  df.interaction <- interaction(df[group], drop = TRUE)
  df.table <- table(df.interaction)
  df.split <- split(df, df.interaction)
  if (length(size) > 1) {
    if (length(size) != length(df.split))
      stop("Number of groups is ", length(df.split),
           " but number of sizes supplied is ", length(size))
    if (is.null(names(size))) {
      n <- setNames(size, names(df.split))
      message(sQuote("size"), " vector entered as:\n\nsize = structure(c(",
              paste(n, collapse = ", "), "),\n.Names = c(",
              paste(shQuote(names(n)), collapse = ", "), ")) \n\n")
    } else {
      ifelse(all(names(size) %in% names(df.split)),
             n <- size[names(df.split)],
             stop("Named vector supplied with names ",
                  paste(names(size), collapse = ", "),
                  "\n but the names for the group levels are ",
                  paste(names(df.split), collapse = ", ")))
    }
  } else if (size < 1) {
    n <- round(df.table * size, digits = 0)
  } else if (size >= 1) {
    if (all(df.table >= size) || isTRUE(replace)) {
      n <- setNames(rep(size, length.out = length(df.split)),
                    names(df.split))
    } else {
      message(
        "Some groups\n---",
        paste(names(df.table[df.table < size]), collapse = ", "),
        "---\ncontain fewer observations",
        " than desired number of samples.\n",
        "All observations have been returned from those groups.")
      n <- c(sapply(df.table[df.table >= size], function(x) x = size),
             df.table[df.table < size])
    }
  }
  temp <- lapply(
    names(df.split),
    function(x) df.split[[x]][sample(df.table[x],
                                     n[x], replace = replace), ])
  set1 <- do.call("rbind", temp)

  if (isTRUE(bothSets)) {
    set2 <- df[!rownames(df) %in% rownames(set1), ]
    list(SET1 = set1, SET2 = set2)
  } else {
    set1
  }
}


# Empty dataframe to hold random draws
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Smps <- data.frame(Die = numeric(), Zn = numeric())

# Right function for reporting progress
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
right = function(text, num_char) {
  substr(text, nchar(text) - (num_char-1), nchar(text))
}

# Draw stratifed samples one from each row
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
n <- 10000          # number of draws
for (j in 1:n){
    y <- stratified(Dt,"Zn",1)
    y <- cbind(y,j)
    Smps <- rbind(Smps,y)
  # report progress
  if(right(j,3) == '000'){
    print(paste(j,'at',Sys.time()))
    flush.console()
  }
}

# Compute the sample means
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Mns <-aggregate(Smps[, 1], list(Smps$j), mean)

# Density plot of means
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
d <- density(Mns$x)
plot(d,xlab = 'Means', las=1, main = '')
polygon(d, col="blue", border="blue")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM