![](/img/trans.png)
[英]How to perform a bootstrap and find 95% confidence interval for the median of a dataset
[英]Line graphs with 95% bootstrap confidence interval
嗨,我使用MatchIt
package 的遺傳匹配方法匹配了治療組和對照組的個體樣本。 結果變量是每個月的平均支出。 現在我正在嘗試生成一個折線圖來展示治療組和對照組的平均每月支出模式。 我想添加陰影區域來顯示這些線的引導置信區間。 我遇到的主要問題包括:
1. 成對引導
匹配樣本的自舉重采樣應基於每個匹配對,而不是每個個體。 謝天謝地,我從這篇文章中得到了解決方案。 代碼將在示例中提供。
2. 繪制引導置信區間
我不確定如何在圖表上 plot 引導置信區間。
這里我提供一個例子:
library(dplyr)
library(tidyr)
library(ggplot2)
ID <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L")
Pair <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6)
Treatment <- c(1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0)
Month_1 <- c(300, 150, 450, 100, 200, 300, 400, 600, 650, 150, 200, 400)
Month_2 <- c(400, 600, 650, 150, 200, 400, 500, 250, 700, 200, 300, 500)
Month_3 <- c(500, 250, 700, 200, 300, 500, 500, 250, 700, 200, 300, 500)
Month_4 <- c(600, 700, 650, 250, 500, 550, 600, 700, 650, 250, 500, 550)
Month_5 <- c(700, 200, 800, 300, 900, 800, 600, 700, 650, 250, 500, 550)
df <- data.frame(ID, Pair, Treatment, Month_1, Month_2, Month_3, Month_4, Month_5)
> df
ID Pair Treatment Month_1 Month_2 Month_3 Month_4 Month_5
1 A 1 1 300 400 500 600 700
2 B 1 0 150 600 250 700 200
3 C 2 1 450 650 700 650 800
4 D 2 0 100 150 200 250 300
5 E 3 1 200 200 300 500 900
6 F 3 0 300 400 500 550 800
7 G 4 1 400 500 500 600 600
8 H 4 0 600 250 250 700 700
9 I 5 1 650 700 700 650 650
10 J 5 0 150 200 200 250 250
11 K 6 1 200 300 300 500 500
12 L 6 0 400 500 500 550 550
在這個數據集中,“Month_1-5”表示每個人的每月支出; “治療”表示個體是在治療組(編碼為1)還是對照組(編碼為0); “Pair”表示每個匹配的對。 例如,個體 A 和 B 在樣本對中,因為它們共享相同的對數“1”。 因此,在 bootstrap 重采樣之后,如果 A 出現在任何樣本中,B 也應該出現。
在這里,我提供了一個代碼來進行引導重采樣並計算治療組和對照組每個月的平均支出:
### 95% paired bootstrap confidence interval
library(boot)
df <- df %>%
mutate(Pair = as.factor(Pair))
pair_ids <- levels(df$Pair)
# For mean of Month_1 in control group
est_control <- function(pairs, i) {
# Compute number of times each pair is present
numreps <- table(pairs[i])
# For each pair p, copy corresponding data indices numreps[p] times
ids <- unlist(lapply(pair_ids[pair_ids %in% names(numreps)],
function(p) rep(which(df$Pair == p),
numreps[p])))
# Subset df with paired bootstrapped ids
md_boot <- df[ids,]
# Estimation
#boot_treat <- c(mean(df[df$Treatment == 0, "Month_1"]), mean(df[df$Treatment == 0, "Month_2"]),
# mean(df[df$Treatment == 0, "Month_3"]), mean(df[df$Treatment == 0, "Month_4"]),
# mean(df[df$Treatment == 0, "Month_5"]))
boot_control <- mean(md_boot[md_boot$Treatment == 0, "Month_1"])
# Return the mean
return(boot_control)
}
set.seed(1234)
boot_est <- boot(pair_ids, est_fun, R = 10000)
boot.ci(boot.out = boot_est, type = "bca")
# For mean of Month_1 in treatment group
est_treat <- function(pairs, i) {
# Compute number of times each pair is present
numreps <- table(pairs[i])
# For each pair p, copy corresponding data indices numreps[p] times
ids <- unlist(lapply(pair_ids[pair_ids %in% names(numreps)],
function(p) rep(which(df$Pair == p),
numreps[p])))
# Subset df with paired bootstrapped ids
md_boot <- df[ids,]
# Estimation
#boot_treat <- c(mean(df[df$Treatment == 0, "Month_1"]), mean(df[df$Treatment == 0, "Month_2"]),
# mean(df[df$Treatment == 0, "Month_3"]), mean(df[df$Treatment == 0, "Month_4"]),
# mean(df[df$Treatment == 0, "Month_5"]))
boot_treat <- mean(md_boot[md_boot$Treatment == 1, "Month_1"])
# Return the mean
return(boot_treat)
}
set.seed(1234)
boot_est <- boot(pair_ids, est_treat, R = 10000)
boot.ci(boot.out = boot_est, type = "bca")
在這種情況下,對照組“Month_1”的自舉置信區間為 [158.3, 441.7],治療組的自舉置信區間為 [250, 500]。
然后我計算對照組和治療組的平均每月支出,然后生成折線圖:
Monthly_spending <- df %>%
group_by(Treatment) %>%
summarise("1" = mean(Month_1, na.rm = T),
"2" = mean(Month_2, na.rm = T),
"3" = mean(Month_3, na.rm = T),
"4" = mean(Month_4, na.rm = T),
"5" = mean(Month_1, na.rm = T)) %>%
pivot_longer(c("1":"5"), names_to = "Month", values_to = "Spending")
Monthly_spending %>%
mutate(Treatment = as.factor(Treatment)) %>%
ggplot(aes(x = Month, y = Spending, group = Treatment, color = Treatment)) +
geom_point() +
geom_line(aes(group = Treatment)) +
ggtitle("Monthly Spending") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_colour_discrete("Groups", labels = c("Control", "Treatment")) +
theme(legend.position = "bottom")
但是,我不知道如何在這兩行周圍添加陰影區域,以指示我上面提供的代碼的引導置信區間。 或者,我可以一次又一次地運行每個時間點的代碼,然后在數據集中手動將其記錄到 plot 圖表上的間隔。 但是,我想知道是否有一種方法可以更有效地進行。 我將非常感謝您的幫助。
這似乎是實現最終目標的一種非常復雜的方法。 ggplot2 function mean-cl_boot
采用輸入向量並返回通過快速引導方法計算的 y、ymin 和 ymax 值。 它也直接在summarize
中工作,因此我們可以將您的整個代碼替換為:
df %>%
pivot_longer(starts_with("Month"), names_to = "Month") %>%
group_by(Month, Treatment) %>%
summarize(mean_cl_boot(value)) %>%
mutate(Month = as.numeric(sub("Month_", "", Month)),
Treatment = c("Control", "Treatment")[Treatment + 1]) %>%
ggplot(aes(Month, y, color = Treatment)) +
geom_ribbon(aes(fill = Treatment, ymin = ymin, ymax = ymax),
alpha = 0.1, color = NA) +
geom_line(size = 1) +
geom_point(size = 3) +
scale_fill_brewer(palette = "Set1") +
scale_color_brewer(palette = "Set1") +
labs(y = "Monthly spending") +
theme_minimal(base_size = 16) +
theme(legend.position = "bottom")
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.