繁体   English   中英

在引导期间在 function 中使用 dplyr::group_by() 时出错::boot() function

[英]Error using dplyr::group_by() in function within boot::boot() function during bootstrapping

我目前正在尝试对一些数据进行引导分析,最终结果是获得围绕计数数据比例的引导置信区间。

例如,我尝试引导的当前数据将采用以下形式(字符):


> foo
   notes
1      a
2      b
3      c
4      c
5      b
6      c
7      b
8      c
9      a
10     a
11     c
12     b
13     d
14     e
15     f
16     f
17     g
18     a
19     b
20     c
21     c

您可以使用dput()到这里

structure(list(notes = c("a", "b", "c", "c", "b", "c", "b", "c", 
"a", "a", "c", "b", "d", "e", "f", "f", "g", "a", "b", "c", "c"
)), class = "data.frame", row.names = c(NA, -21L))

In trying to set up a function that will output a named vector similar to what is needed for the boot package to run properly ( see example here ), I have composed the following function that uses dplyr code:

library(dplyr)

notes_bootstrap <- function(d, i){
  # get global set
  global_set <- d %>% distinct()

  # take random rows 
  sampler <- d#[i,]
  
  proportion_table <- sampler %>%
    count(.data$notes) %>%
    mutate(proportion = n/sum(n)) %>%
    ungroup()
  
  # combine with full set to turn NAs to 0s
  combined_table <- proportion_table %>% full_join(global_set)
  final_table <- combined_table %>% 
    select(-n) %>%
    mutate(proportion = if_else(is.na(proportion),0,proportion))
  
  output <- setNames(final_table$proportion, final_table$notes)
  
  return(output)
  
}

当这个版本的 function 与boot()一起运行时,它运行得很好,关键问题是它只是对整个数据集进行采样(由于代码的注释部分而没有执行引导程序)。 如果你运行这个,你会看到每个估计都是一样的。

bootstrap_analysis <- boot(foo, notes_bootstrap, R = 100)

bootstrap_analysis$t

如果我确实运行 function 并使用随机子集变量进行引导分析的部分,如下面的代码所示(与上面相同,但删除了注释):


notes_bootstrap <- function(d, i){
  # get global set
  global_set <- d %>% distinct()

  # take random rows 
  sampler <- d[i,]
  
  proportion_table <- sampler %>%
    count(.data$notes) %>%
    mutate(proportion = n/sum(n)) %>%
    ungroup()
  
  # combine with full set to turn NAs to 0s
  combined_table <- proportion_table %>% full_join(global_set)
  final_table <- combined_table %>% 
    select(-n) %>%
    mutate(proportion = if_else(is.na(proportion),0,proportion))
  
  output <- setNames(final_table$proportion, final_table$notes)
  
  return(output)
  
}

然后我收到以下错误:

> bootstrap_analysis <- boot(foo, notes_bootstrap, R = 100)
 Error in UseMethod("group_by_") : 
  no applicable method for 'group_by_' applied to an object of class "character" 

该问题的解决方案是运行此代码,以便引导分析按书面方式工作(可能是一个整洁的评估问题?),或者让某人建议一种更有效的方法来进行这种引导分析。

有趣的问题。 我试图在不使用引导包的情况下解决这个问题,而是使用基本功能(主要是出于透明目的)。

我可以弄清楚这一点:

#Assigning the provided structure to an object called "df"

df <- structure(list(notes = c("a", "b", "c", "c", "b", "c", "b", "c", 
                               "a", "a", "c", "b", "d", "e", "f", "f", "g", "a", "b", 
                               "c", "c")),
               class = "data.frame", row.names = c(NA, -21L))

#Specifying the bootstrap replications (as far as I know, it's, however, 
#rather recommended to use 10K replications and more)

B <- 100

#Number of observations (i.e., 21 in this case)

N <- nrow(df)

#setting the seed to ensure pseudo-randomness for the samples
#we just want to generate and, of course, to ensure general reproducibility 

set.seed(42, sample.kind = "Rounding")

#bootstrapping the proportion (aka mean) of the note "a" 

boot_note_a <- replicate(B, {

#taking random samples of the same sample size of those notes 
#and putting back the taken sample in the urn—for each iteration

notes_star <- sample(df$notes, N, replace = T)

#getting the proportion of the note "a" within each bootstrapped sample.
#Hence, we'll get B (100 in this case) times a proportion of the note "a"
#based on the respective bootstrapped sample.  

mean(notes_star == "a")

}) 

#getting the confidence interval (at a confidence level of 95%) of the 
#bootstrapped proportion of the note "a" in the bootstrapped sample

quantile(boot_note_a, prob = c(0.025, 0.975)) 

最后,我们可以快速 plot(双关语)这个结果是这样的:

#calculating the binwidth according to Freedman & Diaconis (1981); see also 
#Hyndman (1995)

binw <- 2 * IQR(boot_note_a) / length(boot_note_a)^(1/3) 

#plotting 
p1 <- qplot(boot_note_a, binwidth = binw, color = I("red") )
p2 <- qplot(sample = scale(boot_note_a), xlab = "theoretical", ylab = "sample")+ 
            geom_abline()
gridExtra::grid.arrange(p1, p2, ncol = 2)

归根结底,我认为你会得到想要的结果——至少对于音符“a”(诚然,我们必须对剩下的六个音符重复这个策略)。 因此,这种解决方案可能不是最有效的方式,但希望是透明的。 如果这个求解策略运行良好,我们可以调整它并通过使用 apply-family 左右来提高它的效率。

干杯,邱!

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM