简体   繁体   English

如何根据选定的日期迭代训练预测模型(GAM、MARS、...)并计算时间段内的变量重要性

[英]How to iteratively train forecast models (GAM, MARS, …) based on selected days and calculate the variable importance in the time period

I have a data table which always have different number of columns and column names and a numeric variable called days (this variable also differs; now/here: 50):我有一个数据表,它总是有不同数量的列和列名以及一个名为days的数字变量(这个变量也不同;现在/这里:50):

library(data.table)
library(caret)

days -> 50  
## Create random data table: ##
dt.train <- data.table(date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 366),
                       "DE" = rnorm(366, 35, 1), "Wind" = rnorm(366, 5000, 2), "Solar" = rnorm(366, 3, 2),
                       "Nuclear" = rnorm(366, 100, 5), "ResLoad" = rnorm(366, 200, 3),  check.names = FALSE)

I'm modelling/training a Linear Model (= LM), where I want to predict the DE column and I calculate the variable importance with respect to the days variable.我正在建模/训练线性 Model (= LM),我想预测 DE 列并计算变量相对于days变量的重要性。 See the following code snippet:请参阅以下代码片段:

## MODEL FITTING: ##
## Linear Model: ##

## Function that calculates the iteratively prediction: ##
calcPred <- function(data){
  ## Model fitting: ##
  xgbModel <- stats::lm(DE ~ .-1-date, data = data)
  ## Model training: ##
  stats::predict.lm(xgbModel, data)
}

## Function that calculates the iteratively variable importance: ##
varImportance <- function(data){
  ## Model fitting: ##
  xgbModel <- stats::lm(DE ~ .-1-date, data = data)
  
  terms <- attr(xgbModel$terms , "term.labels")
  varimp <- caret::varImp(xgbModel)
  importance <- data[, .(date, imp = t(varimp))]
} 


## Train Data PREDICTION with iteratively xgbModel: ##
dt.train <- dt.train[, c('prediction') := calcPred(.SD), by = seq_len(nrow(dt.train)) %/% days]

## Iteratively variable importance:##
dt.importance <- data.table::copy(dt.train[, c("prediction") := NULL])
dt.importance <- dt.importance[, varImportance(.SD), by = seq_len(nrow(dt.train)) %/% days]

What happens here: My model is always trained for 50 days and then precisely for this time period there is a prediction of these trained 50 days done.这里发生了什么:我的 model 总是训练 50 天,然后准确地在这段时间内预测这些训练完成 50 天。 And that continues until the end date of my table.这一直持续到我的桌子的结束日期。 In addition, the varImportance() function gives the variable importances of the predictors (all columns, excluding date and DE ) in the training intervall (here for each 50 days).此外, varImportance() function 给出了训练间隔(此处为每 50 天)中预测变量(所有列,不包括dateDE )的变量重要性。

Originally I thought that I could use the functions calcPred() and varImportance() for a Generalized Additive Model (= GAM) and Multivariative Adaptive Regression Spline (= MARS) or Gradient Boosting (= GB) too, but unfortunately this versions only work with the LM.最初我认为我也可以将函数calcPred()varImportance()用于广义加法 Model (= GAM) 和多元自适应回归样条 (= MARS) 或梯度提升 (= GB),但不幸的是,此版本仅适用于LM。

I would now like to briefly describe the model fitting for the other three models in general, but I would also need your help here so that in the end the GAM, MARS and GB model as well as the LM are calculated.我现在想简要介绍一下 model 适合其他三个模型,但我也需要你的帮助,以便最终计算 GAM、MARS 和 GB model 以及 LM。

GAM:游戏:

## Create data-vector with dates of dt.train: ##
v.trainDate <- dt.train$date
## Delete column "date" of train data for model fitting: ##
dt.train <- dt.train[, c("date") := NULL]

## Preparation for GAM: ##
trainDataNames <- names(dt.train)
responseVar <- trainDataNames[1]
trainDataNames <- trainDataNames[trainDataNames != responseVar]
## Create right-hand side of GAM model in string/character format: ##
formulaRight <- paste('s(', trainDataNames, ')', sep = '', collapse = ' + ')
## Create the whole formula for GAM model in string/character format: ##
formulaGAM <- paste(responseVar, '~', formulaRight, collapse = ' ')
## Coerce to a formula object: ##
formulaGAM <- as.formula(formulaGAM)

## MODEL FITTING: ##
## Generalized Additive Model: ##
xgbModel <- mgcv::gam(formulaGAM, data = dt.train)

## Train and Test Data PREDICTION with xgbModel: ##
dt.train$prediction <- mgcv::predict.gam(xgbModel, dt.train)

## Add date columns to dt.train and dt.test: ##
dt.train <- data.table(date = v.trainDate, dt.train)

MARS:火星:

## Create vectors with all DE values of train data set: ##
v.trainY <- dt.train$DE
## Save dates of train data in an extra vector: ##
v.trainDate <- dt.train$date
## Create train matrices for GB model fitting: ##
m.trainData <- as.matrix(dt.train[, c("date", "DE") := list(NULL, NULL)])
## Model fitting with grid-search: ##: ##
hyper_grid <- expand.grid(degree = 1:3, 
                          nprune = seq(2, 100, length.out = 10) %>% floor()
              )
              
## MODEL FITTING: ##
## Multivariate Adaptive Regression Spline: ##
xgbModel <- caret::train(x = m.trainData, 
                         y = v.trainY,
                         method = "earth",
                         metric = "RMSE",
                         trControl = trainControl(method = "cv", number = 10),
                                       tuneGrid = hyper_grid
              )
              
              
## Train Data PREDICTION with xgbModel: ##
dt.train$prediction <- stats::predict(xgbModel, dt.train)

GB:国标:

## Create vectors with all DE values of train data set: ##
v.trainY <- dt.train$DE
## Save dates of train data in an extra vector: ##
v.trainDate <- dt.train$date
## Create train matrices for GB model fitting: ##
m.trainData <- as.matrix(dt.train[, c("date", "DE") := list(NULL, NULL)])

## Gradient Boosting with hyper parameter tuning: ##
xgb_trcontrol <- caret::trainControl(method = "cv",
                                     number = 3,
                                     allowParallel = TRUE,
                                     verboseIter = TRUE,
                                     returnData = FALSE
)

xgbgrid <- base::expand.grid(nrounds = c(15000), # 15000
                             max_depth = c(2),
                             eta = c(0.01),
                             gamma = c(1),
                             colsample_bytree = c(1),
                             min_child_weight = c(2),
                             subsample = c(0.6)
)

## MODEL FITTING: ##
## Gradient Boosting: ##
xgbModel <- caret::train(x = m.trainData, 
                         y = v.trainY,
                         trControl = xgb_trcontrol,
                         tuneGrid = xgbgrid,
                         method = "xgbTree"
)

## Train data PREDICTION with xgbModel: ##
dt.train$prediction <- stats::predict(xgbModel, m.trainData)

## Add DE and date columns to dt.train: ##
dt.train <- data.table(DE = v.trainY, dt.train)
dt.train <- data.table(date = v.trainDate, dt.train)

How do I calculate the same for the other three models as for the LM?我如何计算其他三个模型与 LM 相同的值? I hope someone can help me.我希望有一个人可以帮助我。 I'm sorry the question got so long.很抱歉这个问题拖了这么久。

You could define the model as a function you pass as argument to calcPred and varImportance .您可以将 model 定义为 function 作为参数传递给calcPredvarImportance

For example with a LM例如使用LM

model <- function(data) {stats::lm(DE ~ .-1-date, data = data)}

With GAM使用GAM

model <- function(data) {mgcv::gam(formulaGAM, data = data)}

with MARS :MARS

model <- function(data) {
  hyper_grid <- expand.grid(degree = 1:3, 
                            nprune = seq(2, 100, length.out = 10) %>% floor())
  caret::train(x = subset(data, select = -DE),
               y = data$DE,
               method = "earth",
               metric = "RMSE",
               trControl = trainControl(method = "cv", number = 10),
               tuneGrid = hyper_grid)
}

I updated the code to take into account this new argument:我更新了代码以考虑到这个新参数:

library(data.table)
library(caret)
library(magrittr)


days <- 50
## Create random data table: ##
dt.train <- data.table(date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 366),
                       "DE" = rnorm(366, 35, 1), "Wind" = rnorm(366, 5000, 2), "Solar" = rnorm(366, 3, 2),
                       "Nuclear" = rnorm(366, 100, 5), "ResLoad" = rnorm(366, 200, 3),  check.names = FALSE)

dt.importance <- data.table::copy(dt.train)

## Define model & prediction functions ##

model <- function(data) {stats::lm(DE ~ .-1-date, data = data)}

predict <- function(data,model) {stats::predict(model, data)}

calcPred <- function(data,model){
  if (nrow(data)==days) {
  stats::predict(model,data) } else {
  NULL }
}

## Function that calculates the iteratively variable importance: ##
varImportance <- function(data,model){
  cat(nrow(data),'\n')
  if (nrow(data)==days) {
  terms <- attr(model$terms , "term.labels")
  varimp <- caret::varImp(model)
  importance <- data[, .(date, imp = t(varimp))]} else
  { NULL }
}


## Train Data PREDICTION with iteratively xgbModel: ##
dt.train <- dt.train[, c('prediction') := calcPred(.SD,model(.SD)), by = (seq_len(nrow(dt.train))-1) %/% days]

## Iteratively variable importance:##

dt.importance <- dt.importance[, varImportance(.SD,model(.SD)), by = (seq_len(nrow(dt.train))-1) %/% days]

To use the other models, just use the model function you wish in the above code.要使用其他型号,只需在上述代码中使用您希望的 model function。 This works with LM or GAM on the dataset you provided.这适用于您提供的数据集上的LMGAM

Unfortunately, varImp seems not to work on your dataset with MARS although this seems feasible .不幸的是, varImp似乎不适用于MARS的数据集,尽管这似乎可行

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何迭代地训练 h2o automl 模型 - How to train h2o automl models iteratively 线性 model 迭代拟合并使用 varImp() 计算迭代中所有预测变量的变量重要性 - Linear model fitting iteratively and calculate the Variable Importance with varImp() for all predictors over the iterations 如何使用CART模型预测时间序列 - How Forecast a time series with CART models 如何计算R中的KNN变量重要性 - How to calculate KNN Variable Importance in R Tidy 模型中随机森林的基于排列的变量重要性(小提琴)图 - Permutation based variable importance (violin) plots for random forest in Tidy models 如何用 rstudio 计算前 5 天的销售数字总和,这样预测? - how to calculate the sum of the sales figures the previous 5 days with rstudio so the forecast? 如何使用 R 插入符号 package 计算每个 class 的变量重要性? - How to calculate variable importance of each class with R caret package? 如何计算条件匹配之前的时间段 - How to calculate a time period until a condition is matched 如何使用插入符包 train() 和 varImp() 在 R 中显示逻辑回归的系数值和变量重要性 - How to show the coefficient values and variable importance for logistic regression in R using caret package train() and varImp() 使用for循环从模型集合中绘制变量重要性 - Plotting variable importance from ensemble of models with for loop
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM