在引导期间在 function 中使用 dplyr::group_by() 时出错::boot() function

Question

我目前正在尝试对一些数据进行引导分析，最终结果是获得围绕计数数据比例的引导置信区间。

例如，我尝试引导的当前数据将采用以下形式（字符）：

您可以使用dput()到这里

structure(list(notes = c("a", "b", "c", "c", "b", "c", "b", "c", 
"a", "a", "c", "b", "d", "e", "f", "f", "g", "a", "b", "c", "c"
)), class = "data.frame", row.names = c(NA, -21L))

In trying to set up a function that will output a named vector similar to what is needed for the boot package to run properly ( see example here ), I have composed the following function that uses dplyr code:

library(dplyr)

notes_bootstrap <- function(d, i){
  # get global set
  global_set <- d %>% distinct()

  # take random rows 
  sampler <- d#[i,]
  
  proportion_table <- sampler %>%
    count(.data$notes) %>%
    mutate(proportion = n/sum(n)) %>%
    ungroup()
  
  # combine with full set to turn NAs to 0s
  combined_table <- proportion_table %>% full_join(global_set)
  final_table <- combined_table %>% 
    select(-n) %>%
    mutate(proportion = if_else(is.na(proportion),0,proportion))
  
  output <- setNames(final_table$proportion, final_table$notes)
  
  return(output)
  
}

当这个版本的 function 与boot()一起运行时，它运行得很好，关键问题是它只是对整个数据集进行采样（由于代码的注释部分而没有执行引导程序）。 如果你运行这个，你会看到每个估计都是一样的。

bootstrap_analysis <- boot(foo, notes_bootstrap, R = 100)

bootstrap_analysis$t

如果我确实运行 function 并使用随机子集变量进行引导分析的部分，如下面的代码所示（与上面相同，但删除了注释）：


notes_bootstrap <- function(d, i){
  # get global set
  global_set <- d %>% distinct()

  # take random rows 
  sampler <- d[i,]
  
  proportion_table <- sampler %>%
    count(.data$notes) %>%
    mutate(proportion = n/sum(n)) %>%
    ungroup()
  
  # combine with full set to turn NAs to 0s
  combined_table <- proportion_table %>% full_join(global_set)
  final_table <- combined_table %>% 
    select(-n) %>%
    mutate(proportion = if_else(is.na(proportion),0,proportion))
  
  output <- setNames(final_table$proportion, final_table$notes)
  
  return(output)
  
}

然后我收到以下错误：

> bootstrap_analysis <- boot(foo, notes_bootstrap, R = 100)
 Error in UseMethod("group_by_") : 
  no applicable method for 'group_by_' applied to an object of class "character"

该问题的解决方案是运行此代码，以便引导分析按书面方式工作（可能是一个整洁的评估问题？），或者让某人建议一种更有效的方法来进行这种引导分析。

Answer 1

有趣的问题。 我试图在不使用引导包的情况下解决这个问题，而是使用基本功能（主要是出于透明目的）。

我可以弄清楚这一点：

#Assigning the provided structure to an object called "df"

df <- structure(list(notes = c("a", "b", "c", "c", "b", "c", "b", "c", 
                               "a", "a", "c", "b", "d", "e", "f", "f", "g", "a", "b", 
                               "c", "c")),
               class = "data.frame", row.names = c(NA, -21L))

#Specifying the bootstrap replications (as far as I know, it's, however, 
#rather recommended to use 10K replications and more)

B <- 100

#Number of observations (i.e., 21 in this case)

N <- nrow(df)

#setting the seed to ensure pseudo-randomness for the samples
#we just want to generate and, of course, to ensure general reproducibility 

set.seed(42, sample.kind = "Rounding")

#bootstrapping the proportion (aka mean) of the note "a" 

boot_note_a <- replicate(B, {

#taking random samples of the same sample size of those notes 
#and putting back the taken sample in the urn—for each iteration

notes_star <- sample(df$notes, N, replace = T)

#getting the proportion of the note "a" within each bootstrapped sample.
#Hence, we'll get B (100 in this case) times a proportion of the note "a"
#based on the respective bootstrapped sample.  

mean(notes_star == "a")

}) 

#getting the confidence interval (at a confidence level of 95%) of the 
#bootstrapped proportion of the note "a" in the bootstrapped sample

quantile(boot_note_a, prob = c(0.025, 0.975))

最后，我们可以快速 plot（双关语）这个结果是这样的：

#calculating the binwidth according to Freedman & Diaconis (1981); see also 
#Hyndman (1995)

binw <- 2 * IQR(boot_note_a) / length(boot_note_a)^(1/3) 

#plotting 
p1 <- qplot(boot_note_a, binwidth = binw, color = I("red") )
p2 <- qplot(sample = scale(boot_note_a), xlab = "theoretical", ylab = "sample")+ 
            geom_abline()
gridExtra::grid.arrange(p1, p2, ncol = 2)

归根结底，我认为你会得到想要的结果——至少对于音符“a”（诚然，我们必须对剩下的六个音符重复这个策略）。 因此，这种解决方案可能不是最有效的方式，但希望是透明的。 如果这个求解策略运行良好，我们可以调整它并通过使用 apply-family 左右来提高它的效率。

干杯，邱！

在引导期间在 function 中使用 dplyr::group_by() 时出错::boot() function

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-07-20 16:33:46

在引导期间在 function 中使用 dplyr::group_by() 时出错::boot() function

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-07-20 16:33:46

解决方案1
1 已采纳 2020-07-20 16:33:46