简体   繁体   中英

How to iteratively train forecast models (GAM, MARS, …) based on selected days and calculate the variable importance in the time period

I have a data table which always have different number of columns and column names and a numeric variable called days (this variable also differs; now/here: 50):

library(data.table)
library(caret)

days -> 50  
## Create random data table: ##
dt.train <- data.table(date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 366),
                       "DE" = rnorm(366, 35, 1), "Wind" = rnorm(366, 5000, 2), "Solar" = rnorm(366, 3, 2),
                       "Nuclear" = rnorm(366, 100, 5), "ResLoad" = rnorm(366, 200, 3),  check.names = FALSE)

I'm modelling/training a Linear Model (= LM), where I want to predict the DE column and I calculate the variable importance with respect to the days variable. See the following code snippet:

## MODEL FITTING: ##
## Linear Model: ##

## Function that calculates the iteratively prediction: ##
calcPred <- function(data){
  ## Model fitting: ##
  xgbModel <- stats::lm(DE ~ .-1-date, data = data)
  ## Model training: ##
  stats::predict.lm(xgbModel, data)
}

## Function that calculates the iteratively variable importance: ##
varImportance <- function(data){
  ## Model fitting: ##
  xgbModel <- stats::lm(DE ~ .-1-date, data = data)
  
  terms <- attr(xgbModel$terms , "term.labels")
  varimp <- caret::varImp(xgbModel)
  importance <- data[, .(date, imp = t(varimp))]
} 


## Train Data PREDICTION with iteratively xgbModel: ##
dt.train <- dt.train[, c('prediction') := calcPred(.SD), by = seq_len(nrow(dt.train)) %/% days]

## Iteratively variable importance:##
dt.importance <- data.table::copy(dt.train[, c("prediction") := NULL])
dt.importance <- dt.importance[, varImportance(.SD), by = seq_len(nrow(dt.train)) %/% days]

What happens here: My model is always trained for 50 days and then precisely for this time period there is a prediction of these trained 50 days done. And that continues until the end date of my table. In addition, the varImportance() function gives the variable importances of the predictors (all columns, excluding date and DE ) in the training intervall (here for each 50 days).

Originally I thought that I could use the functions calcPred() and varImportance() for a Generalized Additive Model (= GAM) and Multivariative Adaptive Regression Spline (= MARS) or Gradient Boosting (= GB) too, but unfortunately this versions only work with the LM.

I would now like to briefly describe the model fitting for the other three models in general, but I would also need your help here so that in the end the GAM, MARS and GB model as well as the LM are calculated.

GAM:

## Create data-vector with dates of dt.train: ##
v.trainDate <- dt.train$date
## Delete column "date" of train data for model fitting: ##
dt.train <- dt.train[, c("date") := NULL]

## Preparation for GAM: ##
trainDataNames <- names(dt.train)
responseVar <- trainDataNames[1]
trainDataNames <- trainDataNames[trainDataNames != responseVar]
## Create right-hand side of GAM model in string/character format: ##
formulaRight <- paste('s(', trainDataNames, ')', sep = '', collapse = ' + ')
## Create the whole formula for GAM model in string/character format: ##
formulaGAM <- paste(responseVar, '~', formulaRight, collapse = ' ')
## Coerce to a formula object: ##
formulaGAM <- as.formula(formulaGAM)

## MODEL FITTING: ##
## Generalized Additive Model: ##
xgbModel <- mgcv::gam(formulaGAM, data = dt.train)

## Train and Test Data PREDICTION with xgbModel: ##
dt.train$prediction <- mgcv::predict.gam(xgbModel, dt.train)

## Add date columns to dt.train and dt.test: ##
dt.train <- data.table(date = v.trainDate, dt.train)

MARS:

## Create vectors with all DE values of train data set: ##
v.trainY <- dt.train$DE
## Save dates of train data in an extra vector: ##
v.trainDate <- dt.train$date
## Create train matrices for GB model fitting: ##
m.trainData <- as.matrix(dt.train[, c("date", "DE") := list(NULL, NULL)])
## Model fitting with grid-search: ##: ##
hyper_grid <- expand.grid(degree = 1:3, 
                          nprune = seq(2, 100, length.out = 10) %>% floor()
              )
              
## MODEL FITTING: ##
## Multivariate Adaptive Regression Spline: ##
xgbModel <- caret::train(x = m.trainData, 
                         y = v.trainY,
                         method = "earth",
                         metric = "RMSE",
                         trControl = trainControl(method = "cv", number = 10),
                                       tuneGrid = hyper_grid
              )
              
              
## Train Data PREDICTION with xgbModel: ##
dt.train$prediction <- stats::predict(xgbModel, dt.train)

GB:

## Create vectors with all DE values of train data set: ##
v.trainY <- dt.train$DE
## Save dates of train data in an extra vector: ##
v.trainDate <- dt.train$date
## Create train matrices for GB model fitting: ##
m.trainData <- as.matrix(dt.train[, c("date", "DE") := list(NULL, NULL)])

## Gradient Boosting with hyper parameter tuning: ##
xgb_trcontrol <- caret::trainControl(method = "cv",
                                     number = 3,
                                     allowParallel = TRUE,
                                     verboseIter = TRUE,
                                     returnData = FALSE
)

xgbgrid <- base::expand.grid(nrounds = c(15000), # 15000
                             max_depth = c(2),
                             eta = c(0.01),
                             gamma = c(1),
                             colsample_bytree = c(1),
                             min_child_weight = c(2),
                             subsample = c(0.6)
)

## MODEL FITTING: ##
## Gradient Boosting: ##
xgbModel <- caret::train(x = m.trainData, 
                         y = v.trainY,
                         trControl = xgb_trcontrol,
                         tuneGrid = xgbgrid,
                         method = "xgbTree"
)

## Train data PREDICTION with xgbModel: ##
dt.train$prediction <- stats::predict(xgbModel, m.trainData)

## Add DE and date columns to dt.train: ##
dt.train <- data.table(DE = v.trainY, dt.train)
dt.train <- data.table(date = v.trainDate, dt.train)

How do I calculate the same for the other three models as for the LM? I hope someone can help me. I'm sorry the question got so long.

You could define the model as a function you pass as argument to calcPred and varImportance .

For example with a LM

model <- function(data) {stats::lm(DE ~ .-1-date, data = data)}

With GAM

model <- function(data) {mgcv::gam(formulaGAM, data = data)}

with MARS :

model <- function(data) {
  hyper_grid <- expand.grid(degree = 1:3, 
                            nprune = seq(2, 100, length.out = 10) %>% floor())
  caret::train(x = subset(data, select = -DE),
               y = data$DE,
               method = "earth",
               metric = "RMSE",
               trControl = trainControl(method = "cv", number = 10),
               tuneGrid = hyper_grid)
}

I updated the code to take into account this new argument:

library(data.table)
library(caret)
library(magrittr)


days <- 50
## Create random data table: ##
dt.train <- data.table(date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 366),
                       "DE" = rnorm(366, 35, 1), "Wind" = rnorm(366, 5000, 2), "Solar" = rnorm(366, 3, 2),
                       "Nuclear" = rnorm(366, 100, 5), "ResLoad" = rnorm(366, 200, 3),  check.names = FALSE)

dt.importance <- data.table::copy(dt.train)

## Define model & prediction functions ##

model <- function(data) {stats::lm(DE ~ .-1-date, data = data)}

predict <- function(data,model) {stats::predict(model, data)}

calcPred <- function(data,model){
  if (nrow(data)==days) {
  stats::predict(model,data) } else {
  NULL }
}

## Function that calculates the iteratively variable importance: ##
varImportance <- function(data,model){
  cat(nrow(data),'\n')
  if (nrow(data)==days) {
  terms <- attr(model$terms , "term.labels")
  varimp <- caret::varImp(model)
  importance <- data[, .(date, imp = t(varimp))]} else
  { NULL }
}


## Train Data PREDICTION with iteratively xgbModel: ##
dt.train <- dt.train[, c('prediction') := calcPred(.SD,model(.SD)), by = (seq_len(nrow(dt.train))-1) %/% days]

## Iteratively variable importance:##

dt.importance <- dt.importance[, varImportance(.SD,model(.SD)), by = (seq_len(nrow(dt.train))-1) %/% days]

To use the other models, just use the model function you wish in the above code. This works with LM or GAM on the dataset you provided.

Unfortunately, varImp seems not to work on your dataset with MARS although this seems feasible .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM