使用 boot::boot() function 和 R 中的分组变量

Question

This is a question both about using the boot() function with grouped variables, but also about passing multiple columns of data into boot.这是一个关于使用带有分组变量的 boot() function 的问题，也是关于将多列数据传递到引导中的问题。 Almost all examples of the boot() function seem to pass a single column of data to calculate a simple bootstrap of the mean. boot() function 的几乎所有示例似乎都传递了单列数据来计算平均值的简单引导。

My specific analysis is trying to use the stats::weighted.mean(x,w) function which takes a vector 'x' of values to calculate the mean and a second vector 'w' for weights.我的具体分析是尝试使用 stats::weighted.mean(x,w) function 来计算权重的平均值和第二个向量“w”。 The main point is that I need two inputs into this function - and I'm hoping the solution will generalize to any function that takes multiple arguments.重点是我需要两个输入到这个 function - 我希望该解决方案能够推广到任何需要多个 arguments 的 function。

I'm also looking for a solution to use this weighted.means function in a dplyr style workflow with group_by() variables.我还在寻找一种解决方案来使用这个 weighted.means function 在 dplyr 风格的工作流程中使用 group_by() 变量。 If the answer is that "it can't be done with dplyr" , that's fine, I'm just trying to figure it out.如果答案是“它不能用 dplyr 完成” ，那很好，我只是想弄清楚。

Below I simulate a dataset with three groups (A,B,C) that each have different ranges of counts.下面我模拟了一个包含三组（A、B、C）的数据集，每组都有不同的计数范围。 I also attempt to come up with a function "my.function" that will be used to bootstrap the weighted average.我还尝试提出一个 function “my.function”，用于引导加权平均值。 Here might be my first mistake: is this how I would set up a function to pass in the 'count' and 'weight' columns of data into each bootstrapped sample?这可能是我的第一个错误：这是我将如何设置 function 以将数据的“计数”和“重量”列传递到每个引导样本中？ Is there some other way to index the data?还有其他方法可以索引数据吗？

Inside the summarise() call, I reference the original data with "."在 summarise() 调用中，我用“。”引用原始数据。 - Possibly another mistake? - 可能是另一个错误？

The end result shows that I was able to achieve appropriately grouped calculations using mean() and weighted.mean(), but the calls for confidence intervals using boot() have instead calculated the 95% confidence interval around the global mean of the dataset.最终结果表明，我能够使用 mean() 和 weighted.mean() 实现适当的分组计算，但是使用 boot() 调用置信区间反而计算了数据集全局平均值周围的 95% 置信区间。

Suggestions on what I'm doing wrong?关于我做错了什么的建议？ Why is the boot() function referencing the entire dataset and not the grouped subsets?为什么 boot() function 引用整个数据集而不是分组子集？

library(tidyverse)
library(boot)


set.seed(20)

sample.data = data.frame(letter = rep(c('A','B','C'),each = 50) %>% as.factor(),
                         counts = c(runif(50,10,30), runif(50,40,60), runif(50,60,100)),
                         weights = sample(10,150, replace = TRUE))



##Define function to bootstrap
  ##I'm using stats::weighted.mean() which needs to take in two arguments

##############
my.function = function(data,index){

  d = data[index,]  #create bootstrap sample of all columns of original data?
  return(weighted.mean(d$counts, d$weights))  #calculate weighted mean using 'counts' and 'weights' columns
  
}

##############

## group by 'letter' and calculate weighted mean, and upper/lower 95% CI limits

## I pass data to boot using "." thinking that this would only pass each grouped subset of data 
  ##(e.g., only letter "A") to boot, but instead it seems to pass the entire dataset. 

sample.data %>% 
  group_by(letter) %>% 
  summarise(avg = mean(counts),
            wtd.avg = weighted.mean(counts, weights),
            CI.LL = boot.ci(boot(., my.function, R = 100), type = "basic")$basic[4],
            CI.UL = boot.ci(boot(., my.function, R = 100), type = "basic")$basic[5])

And below I've calculated a rough estimate of 95% confidence intervals around the global mean to show that this is what was going on with boot() in my summarise() call above下面我粗略估计了围绕全局平均值的 95% 置信区间，以表明这就是我上面的 summarise() 调用中的 boot() 所发生的情况

#Here is a rough 95% confidence interval estimate as +/-  1.96* Standard Error


mean(sample.data$counts) + c(-1,1) * 1.96 * sd(sample.data$counts)/sqrt(length(sample.data[,1]))

Answer 1

The following base R solution solves the problem of bootstrapping by groups.以下基本 R 解决方案解决了按组引导的问题。 Note that boot::boot is only called once.注意boot::boot只被调用一次。

library(boot)

sp <- split(sample.data, sample.data$letter)
y <- lapply(sp, function(x){
  wtd.avg <- weighted.mean(x$counts, x$weights)
  basic <- boot.ci(boot(x, my.function, R = 100), type = "basic")$basic
  CI.LL <- basic[4]
  CI.UL <- basic[5]
  data.frame(wtd.avg, CI.LL, CI.UL)
})

do.call(rbind, y)
#   wtd.avg    CI.LL    CI.UL
#A 19.49044 17.77139 21.16161
#B 50.49048 48.79029 52.55376
#C 82.36993 78.80352 87.51872

Final clean-up:最后清理：

rm(sp)

A dplyr solution could be the following. dplyr解决方案可能如下。 It also calls map_dfr from package purrr .它还从 package purrr map_dfr

library(boot)
library(dplyr)

sample.data %>%
  group_split(letter) %>% 
  purrr::map_dfr(
    function(x){
      wtd.avg <- weighted.mean(x$counts, x$weights)
      basic <- boot.ci(boot(x, my.function, R = 100), type = "basic")$basic
      CI.LL <- basic[4]
      CI.UL <- basic[5]
      data.frame(wtd.avg, CI.LL, CI.UL)
    }
  )
#   wtd.avg    CI.LL    CI.UL
#1 19.49044 17.77139 21.16161
#2 50.49048 48.79029 52.55376
#3 82.36993 78.80352 87.51872

使用 boot::boot() function 和 R 中的分组变量

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-12-31 18:36:44

使用 boot::boot() function 和 R 中的分组变量

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-12-31 18:36:44

解决方案1
1 已采纳 2020-12-31 18:36:44