简体   繁体   English

简单线性回归 model 的迭代

[英]Iteration for simple linear regression model

I would like to fit/train a predictive model and do it intertatively, that is, for example, I train my model every 50 days in the selected period (here: all of 2020).我想拟合/训练一个预测 model 并以交互方式进行,也就是说,例如,我在选定的时间段内每50天训练一次 model(这里:2020 年全部)。 Basically I want to predict the second column ( DE ) and use the remaining columns and parameters for the prediction.基本上我想预测第二列( DE )并使用剩余的列和参数进行预测。 My data table can look like this:我的数据表可能如下所示:

set.seed(123)
days <- 50
## Create random data table: ##
dt.data <- data.table(date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 366),
                      "DE" = rnorm(366, 35, 1), "Wind" = rnorm(366, 5000, 2), "Solar" = rnorm(366, 3, 2),
                      "Nuclear" = rnorm(366, 100, 5), "ResLoad" = rnorm(366, 200, 3),  check.names = FALSE)

The date range of my data table, the number of predictors and the number of days ( days ) can always differ.我的数据表的日期范围、预测变量的数量和天数 ( days ) 总是不同的。 I have already done the model fitting/training for the whole data table once, but I do not know how to do this iteratively every 50 days?我已经对整个数据表进行了一次 model 拟合/训练,但我不知道如何每50天迭代一次? Here you can see a code snippet of my model fitting for a linear model:在这里,您可以看到我的 model 的代码片段,适用于线性 model:

v.trainDate <- dt.data$date
## Delete column "date" of train data for model fitting: ##
dt.data <- dt.data[, c("date") := NULL]

## MODEL FITTING: ##
## Linear Model: ##
lmModel <- stats::lm(DE ~ .-1, data = dt.data)

## Train PREDICTION with lmModel: ##
dt.data$prediction <- stats::predict.glm(lmModel, dt.data)
## Add date columns to dt.train: ##
dt.data <- data.table(date = v.trainDate, dt.data)

What I want to have at the end is that I train the model with my data first from 2020-01-01 to 2020-02-20 (first 50 days) and predict the DE price with this fitted model lmModel for the first fifty entries of my data table.最后我想要的是,我首先用我的数据从2020-01-01年 1 月 1 日到2020-02-20年 2 月 20 日(前 50 天)训练 model,并使用此拟合的 model lmModel预测前 50 个条目的DE价格我的数据表。 Next run should be to train my model from 2020-02-20 to 2020-04-10 (next 50 days) and predict the values for this new 50 days.下一次运行应该是从 2020 年 2 月 20 日到2020-04-10年 4 月 10 日(接下来的 50 天)训练我的2020-02-20 ,并预测这个新的 50 天的值。 This should be done until the last December day for 2020. At the end I need a column, called prediction as you can see in my code snippet, but this column should consist the interatively constructed predictions of the DE price.这应该在 2020 年 12 月的最后一天完成。最后,我需要一列,称为prediction ,正如您在我的代码片段中看到的那样,但该列应该包含对DE价格的交互构建的预测。

I would also like to save the Variable Importance somewhere after each iteration?我还想在每次迭代后将变量重要性保存在某处? So that I can see which variable had the most influence on the DE price in the first 50 days, etc. Does anyone know how this could work?这样我就可以看到在前 50 天内哪个变量对DE价格的影响最大,等等。有谁知道这是如何工作的?

Here would be one way to do it, using the nice features of nested data frames and the map() function.这是一种方法,使用嵌套数据帧和map() function 的良好特性。 Basically create first a dataset of starting dates, then selecting the data relevant for that starting date, then running the regression and extracting the results.基本上首先创建一个开始日期的数据集,然后选择与该开始日期相关的数据,然后运行回归并提取结果。 The final output is shown as a dataset where rows represent the first date of the sample, and columns the date for which the data is predicted.最终的 output 显示为数据集,其中行表示样本的第一个日期,列表示预测数据的日期。

library(tidyverse)
days <- 50
set.seed(123)
## Create random data table: ##
data <- tibble(date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 366),
                  "DE" = rnorm(366, 35, 1), "Wind" = rnorm(366, 5000, 2), "Solar" = rnorm(366, 3, 2),
                  "Nuclear" = rnorm(366, 100, 5), "ResLoad" = rnorm(366, 200, 3),  check.names = FALSE)


data_out <- tibble(train_start_date= tail(data$date, -50)) %>% 
  mutate(data = map(train_start_date, ~filter(data, date >=.x) %>% 
                      head(50)),
         lmModel =map(data, ~stats::lm(DE ~ .-1, data = .x)),
         prediction= map2(lmModel, data,  ~tibble(train_end_date=max(.y$date), 
                                                  prediction=predict(.x), 
                                                 prediction_date = .y$date))) %>% 
  select(train_start_date, prediction) %>% 
  unnest(prediction) %>% 
  mutate(train_n_days = as.integer(train_end_date-train_start_date)+1) %>% 
  select(train_start_date, train_end_date, train_n_days, prediction_date, prediction)

data_out
#> # A tibble: 14,575 x 5
#>    train_start_date train_end_date train_n_days prediction_date prediction
#>    <date>           <date>                <dbl> <date>               <dbl>
#>  1 2020-02-20       2020-04-09               50 2020-02-20            35.0
#>  2 2020-02-20       2020-04-09               50 2020-02-21            35.0
#>  3 2020-02-20       2020-04-09               50 2020-02-22            35.1
#>  4 2020-02-20       2020-04-09               50 2020-02-23            35.2
#>  5 2020-02-20       2020-04-09               50 2020-02-24            34.6
#>  6 2020-02-20       2020-04-09               50 2020-02-25            35.2
#>  7 2020-02-20       2020-04-09               50 2020-02-26            35.0
#>  8 2020-02-20       2020-04-09               50 2020-02-27            35.0
#>  9 2020-02-20       2020-04-09               50 2020-02-28            35.1
#> 10 2020-02-20       2020-04-09               50 2020-02-29            35.1
#> # … with 14,565 more rows

Created on 2021-02-15 by the reprex package (v1.0.0)代表 package (v1.0.0) 于 2021 年 2 月 15 日创建

The calclm function below calculates prediction and the calcImportance function calculates variable importance.下面的calclm function 计算预测, calcImportance function 计算变量重要性。

by=seq_len(nrow(dt.data)) %/% days argument using data.table splits the dataset in days chuncks and applies the previous functions to each chunck: by=seq_len(nrow(dt.data)) %/% days参数使用data.table将数据集拆分为days块并将前面的函数应用于每个块:

library(data.table)
library(caret)

## Create random data table: ##
dt.data <- data.table(date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 366),
                      "DE" = rnorm(366, 35, 1), "Wind" = rnorm(366, 5000, 2), "Solar" = rnorm(366, 3, 2),
                      "Nuclear" = rnorm(366, 100, 5), "ResLoad" = rnorm(366, 200, 3),  check.names = FALSE)

set.seed(123)
days <- 50

# Prediction calculation
calcPred <- function(data) {
  lmModel <- stats::lm(DE ~ .-1-date, data = data)
  stats::predict.glm(lmModel, data)
}

# Importance calculation
calcImportance <- function(data) {
  lmModel <- stats::lm(DE ~ .-1-date, data = data)
  terms <- attr(lmModel$terms , "term.labels")
  varimp <- caret::varImp(lmModel)
  importance <- data[,.(date,imp = t(varimp))]
}

importance.data <- data.table::copy(dt.data)
importance.data[,calcImportance(.SD),by=seq_len(nrow(dt.data)) %/% days]
#>      seq_len       date imp.Wind imp.Solar imp.Nuclear imp.ResLoad
#>   1:       0 2020-01-01 4.598201 2.4726894   0.7993097   1.7153244
#>   2:       0 2020-01-02 4.598201 2.4726894   0.7993097   1.7153244
#>   3:       0 2020-01-03 4.598201 2.4726894   0.7993097   1.7153244
#>   4:       0 2020-01-04 4.598201 2.4726894   0.7993097   1.7153244
#>   5:       0 2020-01-05 4.598201 2.4726894   0.7993097   1.7153244
#>  ---                                                              
#> 362:       7 2020-12-27 1.093177 0.2558265   0.1610440   0.5383146
#> 363:       7 2020-12-28 1.093177 0.2558265   0.1610440   0.5383146
#> 364:       7 2020-12-29 1.093177 0.2558265   0.1610440   0.5383146
#> 365:       7 2020-12-30 1.093177 0.2558265   0.1610440   0.5383146
#> 366:       7 2020-12-31 1.093177 0.2558265   0.1610440   0.5383146

dt.data[,c('prediction'):=calcPred(.SD),by=seq_len(nrow(dt.data)) %/% days]

dt.data
#>            date       DE     Wind      Solar   Nuclear  ResLoad prediction
#>   1: 2020-01-01 36.51972 5000.608  1.4283653  92.19844 200.1163   35.02625
#>   2: 2020-01-02 34.96544 4999.235  0.3860045  92.29005 203.2613   34.96200
#>   3: 2020-01-03 35.16448 5002.232 -1.4450524 100.32920 202.5225   35.67255
#>   4: 2020-01-04 36.07978 5000.564 -0.9137483  98.07459 206.3788   35.13325
#>   5: 2020-01-05 35.10967 4997.606  4.9788029 101.27625 201.8148   34.29788
#>  ---                                                                      
#> 362: 2020-12-27 34.98190 4997.936  2.6117305  98.33027 195.1352   34.80871
#> 363: 2020-12-28 35.16974 4998.799  1.0776123 108.26064 195.2474   35.01232
#> 364: 2020-12-29 35.37956 4998.651  3.9252237 102.87948 201.0266   35.09362
#> 365: 2020-12-30 35.51517 4999.428  5.9747031  92.38721 196.4204   34.60962
#> 366: 2020-12-31 33.53278 5001.911  3.5062344  93.60744 197.5292   34.85689



声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM