使用 boot::boot() function 和 R 中的分組變量

Question

這是一個關於使用帶有分組變量的 boot() function 的問題，也是關於將多列數據傳遞到引導中的問題。 boot() function 的幾乎所有示例似乎都傳遞了單列數據來計算平均值的簡單引導。

我的具體分析是嘗試使用 stats::weighted.mean(x,w) function 來計算權重的平均值和第二個向量“w”。 重點是我需要兩個輸入到這個 function - 我希望該解決方案能夠推廣到任何需要多個 arguments 的 function。

我還在尋找一種解決方案來使用這個 weighted.means function 在 dplyr 風格的工作流程中使用 group_by() 變量。 如果答案是“它不能用 dplyr 完成” ，那很好，我只是想弄清楚。

下面我模擬了一個包含三組（A、B、C）的數據集，每組都有不同的計數范圍。 我還嘗試提出一個 function “my.function”，用於引導加權平均值。 這可能是我的第一個錯誤：這是我將如何設置 function 以將數據的“計數”和“重量”列傳遞到每個引導樣本中？ 還有其他方法可以索引數據嗎？

在 summarise() 調用中，我用“。”引用原始數據。 - 可能是另一個錯誤？

最終結果表明，我能夠使用 mean() 和 weighted.mean() 實現適當的分組計算，但是使用 boot() 調用置信區間反而計算了數據集全局平均值周圍的 95% 置信區間。

關於我做錯了什么的建議？ 為什么 boot() function 引用整個數據集而不是分組子集？

library(tidyverse)
library(boot)


set.seed(20)

sample.data = data.frame(letter = rep(c('A','B','C'),each = 50) %>% as.factor(),
                         counts = c(runif(50,10,30), runif(50,40,60), runif(50,60,100)),
                         weights = sample(10,150, replace = TRUE))



##Define function to bootstrap
  ##I'm using stats::weighted.mean() which needs to take in two arguments

##############
my.function = function(data,index){

  d = data[index,]  #create bootstrap sample of all columns of original data?
  return(weighted.mean(d$counts, d$weights))  #calculate weighted mean using 'counts' and 'weights' columns
  
}

##############

## group by 'letter' and calculate weighted mean, and upper/lower 95% CI limits

## I pass data to boot using "." thinking that this would only pass each grouped subset of data 
  ##(e.g., only letter "A") to boot, but instead it seems to pass the entire dataset. 

sample.data %>% 
  group_by(letter) %>% 
  summarise(avg = mean(counts),
            wtd.avg = weighted.mean(counts, weights),
            CI.LL = boot.ci(boot(., my.function, R = 100), type = "basic")$basic[4],
            CI.UL = boot.ci(boot(., my.function, R = 100), type = "basic")$basic[5])

下面我粗略估計了圍繞全局平均值的 95% 置信區間，以表明這就是我上面的 summarise() 調用中的 boot() 所發生的情況

#Here is a rough 95% confidence interval estimate as +/-  1.96* Standard Error


mean(sample.data$counts) + c(-1,1) * 1.96 * sd(sample.data$counts)/sqrt(length(sample.data[,1]))

Answer 1

以下基本 R 解決方案解決了按組引導的問題。 注意boot::boot只被調用一次。

library(boot)

sp <- split(sample.data, sample.data$letter)
y <- lapply(sp, function(x){
  wtd.avg <- weighted.mean(x$counts, x$weights)
  basic <- boot.ci(boot(x, my.function, R = 100), type = "basic")$basic
  CI.LL <- basic[4]
  CI.UL <- basic[5]
  data.frame(wtd.avg, CI.LL, CI.UL)
})

do.call(rbind, y)
#   wtd.avg    CI.LL    CI.UL
#A 19.49044 17.77139 21.16161
#B 50.49048 48.79029 52.55376
#C 82.36993 78.80352 87.51872

最后清理：

rm(sp)

dplyr解決方案可能如下。 它還從 package purrr map_dfr

library(boot)
library(dplyr)

sample.data %>%
  group_split(letter) %>% 
  purrr::map_dfr(
    function(x){
      wtd.avg <- weighted.mean(x$counts, x$weights)
      basic <- boot.ci(boot(x, my.function, R = 100), type = "basic")$basic
      CI.LL <- basic[4]
      CI.UL <- basic[5]
      data.frame(wtd.avg, CI.LL, CI.UL)
    }
  )
#   wtd.avg    CI.LL    CI.UL
#1 19.49044 17.77139 21.16161
#2 50.49048 48.79029 52.55376
#3 82.36993 78.80352 87.51872

使用 boot::boot() function 和 R 中的分組變量

問題描述

1 個解決方案

解決方案1
1 已采納 2020-12-31 18:36:44

使用 boot::boot() function 和 R 中的分組變量

問題描述

1 個解決方案

解決方案1 1 已采納 2020-12-31 18:36:44

解決方案1
1 已采納 2020-12-31 18:36:44