[英]How to iteratively train forecast models (GAM, MARS, …) based on selected days and calculate the variable importance in the time period
I have a data table which always have different number of columns and column names and a numeric variable called days
(this variable also differs; now/here: 50):我有一个数据表,它总是有不同数量的列和列名以及一个名为days
的数字变量(这个变量也不同;现在/这里:50):
library(data.table)
library(caret)
days -> 50
## Create random data table: ##
dt.train <- data.table(date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 366),
"DE" = rnorm(366, 35, 1), "Wind" = rnorm(366, 5000, 2), "Solar" = rnorm(366, 3, 2),
"Nuclear" = rnorm(366, 100, 5), "ResLoad" = rnorm(366, 200, 3), check.names = FALSE)
I'm modelling/training a Linear Model (= LM), where I want to predict the DE column and I calculate the variable importance with respect to the days
variable.我正在建模/训练线性 Model (= LM),我想预测 DE 列并计算变量相对于days
变量的重要性。 See the following code snippet:请参阅以下代码片段:
## MODEL FITTING: ##
## Linear Model: ##
## Function that calculates the iteratively prediction: ##
calcPred <- function(data){
## Model fitting: ##
xgbModel <- stats::lm(DE ~ .-1-date, data = data)
## Model training: ##
stats::predict.lm(xgbModel, data)
}
## Function that calculates the iteratively variable importance: ##
varImportance <- function(data){
## Model fitting: ##
xgbModel <- stats::lm(DE ~ .-1-date, data = data)
terms <- attr(xgbModel$terms , "term.labels")
varimp <- caret::varImp(xgbModel)
importance <- data[, .(date, imp = t(varimp))]
}
## Train Data PREDICTION with iteratively xgbModel: ##
dt.train <- dt.train[, c('prediction') := calcPred(.SD), by = seq_len(nrow(dt.train)) %/% days]
## Iteratively variable importance:##
dt.importance <- data.table::copy(dt.train[, c("prediction") := NULL])
dt.importance <- dt.importance[, varImportance(.SD), by = seq_len(nrow(dt.train)) %/% days]
What happens here: My model is always trained for 50 days and then precisely for this time period there is a prediction of these trained 50 days done.这里发生了什么:我的 model 总是训练 50 天,然后准确地在这段时间内预测这些训练完成 50 天。 And that continues until the end date of my table.这一直持续到我的桌子的结束日期。 In addition, the varImportance()
function gives the variable importances of the predictors (all columns, excluding date
and DE
) in the training intervall (here for each 50 days).此外, varImportance()
function 给出了训练间隔(此处为每 50 天)中预测变量(所有列,不包括date
和DE
)的变量重要性。
Originally I thought that I could use the functions calcPred()
and varImportance()
for a Generalized Additive Model (= GAM) and Multivariative Adaptive Regression Spline (= MARS) or Gradient Boosting (= GB) too, but unfortunately this versions only work with the LM.最初我认为我也可以将函数calcPred()
和varImportance()
用于广义加法 Model (= GAM) 和多元自适应回归样条 (= MARS) 或梯度提升 (= GB),但不幸的是,此版本仅适用于LM。
I would now like to briefly describe the model fitting for the other three models in general, but I would also need your help here so that in the end the GAM, MARS and GB model as well as the LM are calculated.我现在想简要介绍一下 model 适合其他三个模型,但我也需要你的帮助,以便最终计算 GAM、MARS 和 GB model 以及 LM。
GAM:游戏:
## Create data-vector with dates of dt.train: ##
v.trainDate <- dt.train$date
## Delete column "date" of train data for model fitting: ##
dt.train <- dt.train[, c("date") := NULL]
## Preparation for GAM: ##
trainDataNames <- names(dt.train)
responseVar <- trainDataNames[1]
trainDataNames <- trainDataNames[trainDataNames != responseVar]
## Create right-hand side of GAM model in string/character format: ##
formulaRight <- paste('s(', trainDataNames, ')', sep = '', collapse = ' + ')
## Create the whole formula for GAM model in string/character format: ##
formulaGAM <- paste(responseVar, '~', formulaRight, collapse = ' ')
## Coerce to a formula object: ##
formulaGAM <- as.formula(formulaGAM)
## MODEL FITTING: ##
## Generalized Additive Model: ##
xgbModel <- mgcv::gam(formulaGAM, data = dt.train)
## Train and Test Data PREDICTION with xgbModel: ##
dt.train$prediction <- mgcv::predict.gam(xgbModel, dt.train)
## Add date columns to dt.train and dt.test: ##
dt.train <- data.table(date = v.trainDate, dt.train)
MARS:火星:
## Create vectors with all DE values of train data set: ##
v.trainY <- dt.train$DE
## Save dates of train data in an extra vector: ##
v.trainDate <- dt.train$date
## Create train matrices for GB model fitting: ##
m.trainData <- as.matrix(dt.train[, c("date", "DE") := list(NULL, NULL)])
## Model fitting with grid-search: ##: ##
hyper_grid <- expand.grid(degree = 1:3,
nprune = seq(2, 100, length.out = 10) %>% floor()
)
## MODEL FITTING: ##
## Multivariate Adaptive Regression Spline: ##
xgbModel <- caret::train(x = m.trainData,
y = v.trainY,
method = "earth",
metric = "RMSE",
trControl = trainControl(method = "cv", number = 10),
tuneGrid = hyper_grid
)
## Train Data PREDICTION with xgbModel: ##
dt.train$prediction <- stats::predict(xgbModel, dt.train)
GB:国标:
## Create vectors with all DE values of train data set: ##
v.trainY <- dt.train$DE
## Save dates of train data in an extra vector: ##
v.trainDate <- dt.train$date
## Create train matrices for GB model fitting: ##
m.trainData <- as.matrix(dt.train[, c("date", "DE") := list(NULL, NULL)])
## Gradient Boosting with hyper parameter tuning: ##
xgb_trcontrol <- caret::trainControl(method = "cv",
number = 3,
allowParallel = TRUE,
verboseIter = TRUE,
returnData = FALSE
)
xgbgrid <- base::expand.grid(nrounds = c(15000), # 15000
max_depth = c(2),
eta = c(0.01),
gamma = c(1),
colsample_bytree = c(1),
min_child_weight = c(2),
subsample = c(0.6)
)
## MODEL FITTING: ##
## Gradient Boosting: ##
xgbModel <- caret::train(x = m.trainData,
y = v.trainY,
trControl = xgb_trcontrol,
tuneGrid = xgbgrid,
method = "xgbTree"
)
## Train data PREDICTION with xgbModel: ##
dt.train$prediction <- stats::predict(xgbModel, m.trainData)
## Add DE and date columns to dt.train: ##
dt.train <- data.table(DE = v.trainY, dt.train)
dt.train <- data.table(date = v.trainDate, dt.train)
How do I calculate the same for the other three models as for the LM?我如何计算其他三个模型与 LM 相同的值? I hope someone can help me.我希望有一个人可以帮助我。 I'm sorry the question got so long.很抱歉这个问题拖了这么久。
You could define the model as a function you pass as argument to calcPred
and varImportance
.您可以将 model 定义为 function 作为参数传递给calcPred
和varImportance
。
For example with a LM
例如使用LM
model <- function(data) {stats::lm(DE ~ .-1-date, data = data)}
With GAM
使用GAM
model <- function(data) {mgcv::gam(formulaGAM, data = data)}
with MARS
:与MARS
:
model <- function(data) {
hyper_grid <- expand.grid(degree = 1:3,
nprune = seq(2, 100, length.out = 10) %>% floor())
caret::train(x = subset(data, select = -DE),
y = data$DE,
method = "earth",
metric = "RMSE",
trControl = trainControl(method = "cv", number = 10),
tuneGrid = hyper_grid)
}
I updated the code to take into account this new argument:我更新了代码以考虑到这个新参数:
library(data.table)
library(caret)
library(magrittr)
days <- 50
## Create random data table: ##
dt.train <- data.table(date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 366),
"DE" = rnorm(366, 35, 1), "Wind" = rnorm(366, 5000, 2), "Solar" = rnorm(366, 3, 2),
"Nuclear" = rnorm(366, 100, 5), "ResLoad" = rnorm(366, 200, 3), check.names = FALSE)
dt.importance <- data.table::copy(dt.train)
## Define model & prediction functions ##
model <- function(data) {stats::lm(DE ~ .-1-date, data = data)}
predict <- function(data,model) {stats::predict(model, data)}
calcPred <- function(data,model){
if (nrow(data)==days) {
stats::predict(model,data) } else {
NULL }
}
## Function that calculates the iteratively variable importance: ##
varImportance <- function(data,model){
cat(nrow(data),'\n')
if (nrow(data)==days) {
terms <- attr(model$terms , "term.labels")
varimp <- caret::varImp(model)
importance <- data[, .(date, imp = t(varimp))]} else
{ NULL }
}
## Train Data PREDICTION with iteratively xgbModel: ##
dt.train <- dt.train[, c('prediction') := calcPred(.SD,model(.SD)), by = (seq_len(nrow(dt.train))-1) %/% days]
## Iteratively variable importance:##
dt.importance <- dt.importance[, varImportance(.SD,model(.SD)), by = (seq_len(nrow(dt.train))-1) %/% days]
To use the other models, just use the model function you wish in the above code.要使用其他型号,只需在上述代码中使用您希望的 model function。 This works with LM
or GAM
on the dataset you provided.这适用于您提供的数据集上的LM
或GAM
。
Unfortunately, varImp
seems not to work on your dataset with MARS
although this seems feasible .不幸的是, varImp
似乎不适用于MARS
的数据集,尽管这似乎可行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.