简体   繁体   English

在R中观察多行的Boostrap

[英]Boostrap with multiple rows as observation in R

I'm trying to do a bootstrap on my data but it is extremely slow. 我正在尝试对数据进行引导,但是速度非常慢。 I use R on a windows laptop. 我在Windows笔记本电脑上使用R。 I have a dataset that has multiple rows to indicate an observation, and these need to stay together when I bootstrap my data. 我有一个数据集,其中包含多行以指示观察值,当我引导数据时,这些行必须保持在一起。 I have a variable that indicates which row belongs to which subject, called VacancyId. 我有一个变量,它指示哪个行属于哪个主题,称为VacancyId。 I use a variation of this method , as my output follows a Poisson distribution. 我使用此方法的一种变体,因为我的输出遵循泊松分布。 The real bottleneck in my code is in lapply function: 我的代码中真正的瓶颈在于lapply函数:

bootSimFun.can <- function(preddata,opreddata,data) {
# sample by VacancyId because of dependencies
samp <- sample(unique(data$VacancyId), replace=TRUE)
# create bootstrapped data
bdata <- bind_rows(lapply(samp, function(x) data[data$VacancyId == x,]))
# remove NA
bdata <- na.omit(bdata)
# create x with the appropriate variables
x <- bdata[,c('VacancyBankId', 'VacancyFunctionId', 
'VacancyEducationLevelId', 'VacancyProvinceId')]
# make sure the variables are seen as categorical
x$FunctionId <- as.factor(x$VacancyFunctionId)
x$EducationLevel <- as.factor(x$VacancyEducationLevelId)
x$ProvinceId <- as.factor(x$VacancyProvinceId)
x$VacancyBankId <- as.factor(x$VacancyBankId)
# allocate outcome
y <- bdata$CandidatesPerWeek
# create dummy matrix
x.onehot <- model.matrix(~ . + 0, data = x)
# create parameters for the xgboost
xgb_params <- list("objective" = "count:poisson",
                 "eval_metric" = "rmse")
# train model
newmodel <- xgboost(data = x.onehot,
                  label = y,
                  nrounds = 10,
                  params = xgb_params)
# make predictions
bpred <- predict(newmodel,type="response",newdata=preddata)
# make predictions for 60 days
bpred <- bpred*(60/7)
# bind the predictions with the original data
bpredictions <- cbind(opreddata, bpred)
# we are interested in the predictions at campaign level, so we sum up by Function, Education, and Province
# which vacancybanks are used is specified in the make.predictions dataframe
aggregate <- aggregate(bpred ~ VacancyFunctionId + VacancyEducationLevelId + 
VacancyProvinceId, data=bpredictions, sum, na.rm=TRUE)
# make sure it is in the right order
aggregate <- aggregate[order(aggregate$VacancyFunctionId, 
aggregate$VacancyEducationLevelId, aggregate$VacancyProvinceId),]
# Generated random numbers based on Poisson distribution with the mean, e.g. 
lambda, equal to the predicted values from refitted models
rpois(length(aggregate$bpred), lambda=aggregate$bpred)
}

Especially the lapply function is really slow. 尤其是lapply函数确实很慢。 Is there an alternative? 还有其他选择吗?

A small part of my data: 我数据的一小部分:

structure(list(VacancyId = structure(c(4L, 3L, 6L, 7L, 3L, 6L, 
6L, 7L, 3L, 4L, 7L, 4L, 4L, 2L, 2L, 2L, 2L, 1L, 5L, 5L), .Label = c("57772", 
"57775", "57818", "57820", "57821", "57822", "57871"), class = "factor"), 
VacancyBankId = structure(c(2L, 1L, 2L, 2L, 3L, 1L, 3L, 3L, 
2L, 4L, 6L, 1L, 3L, 7L, 3L, 5L, 1L, 4L, 1L, 2L), .Label = c("2", 
"17", "147", "257", "991", "1565", "1609"), class = "factor"), 
VacancyFunctionId = structure(c(2L, 1L, 1L, 2L, 1L, 1L, 1L, 
2L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 2L, 2L), .Label = c("3", 
"4", "5", "11"), class = "factor"), VacancyEducationLevelId = structure(c(2L, 
2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 
3L, 2L, 2L, 2L), .Label = c("4", "6", "8"), class = "factor"), 
VacancyProvinceId = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("19", 
"21"), class = "factor"), CandidatesPerWeek = c(0, 2, 0, 
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0)), 
.Names = c("VacancyId", "VacancyBankId", "VacancyFunctionId", "VacancyEducationLevelId", "VacancyProvinceId", "CandidatesPerWeek"), 
row.names = c(3L, 9L, 10L, 19L, 20L, 26L, 27L, 33L, 37L, 38L, 56L, 57L, 58L, 69L, 70L, 72L, 73L, 122L, 125L, 128L), class = "data.frame")

The outcome is CandidatesPerWeek. 结果是CandidatesPerWeek。 VacancyBankId, VacancyFunctionId, VacancyEducationLevelId are VacancyProvinceId are the input. VacancyBankId,VacancyFunctionId,VacancyEducationLevelId和VacancyProvinceId是输入。

I convert to data.table , convert to factors, sample the factors and reorder the table by them and I'm a bit more than 5 times faster on your sample data and on a 1 million row table made from your data. 我将其转换为data.table ,将其转换为因子,对因子进行采样并根据它们对表进行重新排序,我的采样数据和由数据构成的100万行表的运行速度要快5倍以上。

I use this code: 我使用以下代码:

data2 <- copy(data)
data2 <- setDT(data2)
data2$VacancyId <- factor(data2$VacancyId,sample(levels(data2VacancyId),replace=TRUE))
setorder(data2,VacancyId)

And here's the benchmark 这是基准

data2 <- copy(data)
data2 <- setDT(data2)
data2$VacancyId <- factor(data2$VacancyId)

f1 <- function(data){
  samp <- sample(unique(data$VacancyId), replace=TRUE)
  bdata <- bind_rows(lapply(samp, function(x) data[data$VacancyId == x,]))
  bdata
}

f2 <- function(data2){
  data2$VacancyId <- factor(data2$VacancyId,sample(levels(data2$VacancyId),replace=TRUE))
  setorder(data2,VacancyId)
  data2
}

library(microbenchmark)
microbenchmark(f1(data),f2(data2),times=1000)
# Unit: microseconds
#      expr      min        lq      mean    median       uq      max neval
#  f1(data) 2193.213 2406.9770 2616.2763 2492.2700 2591.189 21471.67  1000
# f2(data2)  308.261  372.3195  452.6593  409.4805  450.889 18877.83  1000

About 5 times faster , let's check with bigger data (1 million rows replicated from your example) 快大约5倍,让我们检查更大的数据(从您的示例中复制了100万行)

set.seed(1)
big_data <- data[sample(1:nrow(data),1000000,replace=TRUE),]
big_data2 <- copy(big_data)
big_data2 <- setDT(big_data2)
big_data2$VacancyId <- factor(big_data2$VacancyId)
library(microbenchmark)
microbenchmark(f1(big_data),f2(big_data2),times=50)

# Unit: milliseconds
# expr       min       lq     mean   median       uq      max neval
# f1(big_data) 525.38332 556.3378 598.5809 570.0738 592.2092 899.6736    10
# f2(big_data2)  61.43292  66.3120 124.4283 107.0262 123.4962 374.3961    10

Still 5 times faster 仍然快5倍

Found it! 找到了!

 samp <- sample(unique(d.9weeks$VacancyId), replace = TRUE)
 datDT <- as.data.table(d.9weeks)
 setkey(datDT, "VacancyId")
 # create bootstrapped data
 bdata <- datDT[J(samp), allow.cartesian = TRUE]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM