简体   繁体   English

R BootStrap with Strata

[英]R BootStrap With Strata

I learned how to do bootstrap regression.我学会了如何进行引导回归。 Here is my example and Code.这是我的示例和代码。

   library(boot)
    sampledata = data.frame(y = sample(0:20, r = T, size = 5000),
                            x1 = runif(5000),
                            x2 = runif(5000),
                            group1 = sample(1:3, r = T, size = 5000),
                            group2 = sample(0:1, r = T, size = 5000))


# function to obtain R-Squared from the data
rsq <- function(formula, data, indices) {
  d <- data[indices,] 
  fit <- lm(formula, data=d)
  return(coef(fit))
}

# bootstrapping with 250 replications
results <- boot(data=sampledata, statistic=rsq,
   R=250, formula=y~x1+x2)
df <- data.frame(results$t)
names(df) <- names(results$t0)

However if you want to do this process for subgroups of your data how can you make it more automated?但是,如果您想对数据的子组执行此过程,如何使其更加自动化? This is what I hope to do but this is very tiresome and I have many more group categories but this is basically the hopeful result, basically I subset the sampledata into 6 groups and then use all those data sets separate in 6 different bootstrap models and then combine the results in DATA.这是我希望做的,但这很烦人,我有更多的组类别,但这基本上是有希望的结果,基本上我将 sampledata 分成 6 个组,然后在 6 个不同的引导模型中使用所有这些数据集,然后将结果合并到 DATA 中。 I see boot has strata feature but am not getting it to function.我看到引导具有分层功能,但没有得到 function。 Also are you supposed to subset your data to group and then do the bootstrap sample?您是否还应该将数据子集化以进行分组,然后进行引导示例? Or get a bootstrap sample and then subset your data?或者获取一个引导样本,然后对您的数据进行子集化? Any way, I am wondering how can it be possible to make this more automatic where I can code the program to do the bootstrap estimate for all groups separately, where the models are stratified.无论如何,我想知道如何使这更加自动化,我可以编写程序以分别对所有组进行引导估计,其中模型是分层的。

sampledata1 = subset(sampledata, group1 == 1 & group2 == 0)
sampledata2 = subset(sampledata, group1 == 2 & group2 == 0)
sampledata3 = subset(sampledata, group1 == 3 & group2 == 0)
sampledata4 = subset(sampledata, group1 == 1 & group2 == 1)
sampledata5 = subset(sampledata, group1 == 2 & group2 == 1)
sampledata6 = subset(sampledata, group1 == 3 & group2 == 1)



# bootstrapping with 250 replications
results <- boot(data=sampledata1, statistic=rsq,
   R=250, formula=y~x1+x2)
df1 <- data.frame(results$t)
names(df1) <- names(results$t0)
df1$group1 = 1
df1$group2 = 0

# bootstrapping with 250 replications
results <- boot(data=sampledata2, statistic=rsq,
   R=250, formula=y~x1+x2)
df2 <- data.frame(results$t)
names(df2) <- names(results$t0)
df2$group1 = 2
df2$group2 = 0

# bootstrapping with 250 replications
results <- boot(data=sampledata3, statistic=rsq,
   R=250, formula=y~x1+x2)
df3 <- data.frame(results$t)
names(df3) <- names(results$t0)
df3$group1 = 3
df3$group2 = 0

# bootstrapping with 250 replications
results <- boot(data=sampledata4, statistic=rsq,
   R=250, formula=y~x1+x2)
df4 <- data.frame(results$t)
names(df4) <- names(results$t0)
df4$group1 = 1
df4$group2 = 1

# bootstrapping with 250 replications
results <- boot(data=sampledata5, statistic=rsq,
   R=250, formula=y~x1+x2)
df5 <- data.frame(results$t)
names(df5) <- names(results$t0)
df5$group1 = 2
df5$group2 = 1

# bootstrapping with 250 replications
results <- boot(data=sampledata6, statistic=rsq,
   R=250, formula=y~x1+x2)
df6 <- data.frame(results$t)
names(df6) <- names(results$t0)
df6$group1 = 3
df6$group2 = 1

DATA = rbind(df1, df2, df3, df4, df5, df6)

rsq <- function(formula, data, indices) {
  dataSUB <- data %>% filter(GROUP1 == group1, GROUP2 == group2)
  d <- dataSUB[indices,] 
  fit <- lm(formula, data=d)
  return(coef(fit))
}

This should do the trick:这应该可以解决问题:

set1 <- c(1,2,3)
set2 <- c(0,1)
for(i in set1) {
    for(j in set2) {
        sampleDataSet = subset(sampledata, group1 == i & group2 == j)
        results <- boot(data=sampleDataSet, statistic=rsq, R=250, formula=y~x1+x2)
        df <- data.frame(results$t)
        names(df) <- names(results$t0)
        df$group1 = i
        df$group2 = j
        assign(paste("df",i+(j*length(set1)),sep=""), df)
    }
}

Note: if you plan to do this with larger groups, assuming 0 is always in set2 you can just update set1 and set2 and the naming scheme should continue to work.注意:如果您打算对较大的组执行此操作,假设0始终在set2中,您只需更新set1set2并且命名方案应该继续工作。

I hope this helps!我希望这有帮助!

Using data.table may not enhance the time this would take to run.使用 data.table 可能不会增加运行时间。 However here is a similar alternative with data.table and an assisting function to generalize the process:然而,这里有一个与 data.table 和辅助 function 类似的替代方法来概括该过程:

library(data.table)

setDT(sampledata)

boot_group <- function(sel_group1, sel_group2){
  sampledata1  <- sampledata[group1 == sel_group1 & group2 == sel_group2]
  results <- boot(data=sampledata1, statistic=rsq,
                  R=250, formula=y~x1+x2)
  df1 <- data.table(results$t)
  names(df1) <- names(results$t0)
  df1[, group1 := sel_group1]
  df1[, group2 := sel_group2]
  return(df1)
}

DATA <- data.table()
groups1 <- sampledata[, unique(group1)]
groups1 <- sort(groups1)
groups2 <- sampledata[, unique(group2)]
groups2 <- sort(groups2)

for (g2 in groups2) {
  for (g1 in groups1) {
    subdf <- boot_group(g1, g2)
    DATA <- rbind(DATA, subdf)
  }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM