[英]Iteration for simple linear regression model
I would like to fit/train a predictive model and do it intertatively, that is, for example, I train my model every 50
days in the selected period (here: all of 2020).我想拟合/训练一个预测 model 并以交互方式进行,也就是说,例如,我在选定的时间段内每
50
天训练一次 model(这里:2020 年全部)。 Basically I want to predict the second column ( DE
) and use the remaining columns and parameters for the prediction.基本上我想预测第二列(
DE
)并使用剩余的列和参数进行预测。 My data table can look like this:我的数据表可能如下所示:
set.seed(123)
days <- 50
## Create random data table: ##
dt.data <- data.table(date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 366),
"DE" = rnorm(366, 35, 1), "Wind" = rnorm(366, 5000, 2), "Solar" = rnorm(366, 3, 2),
"Nuclear" = rnorm(366, 100, 5), "ResLoad" = rnorm(366, 200, 3), check.names = FALSE)
The date range of my data table, the number of predictors and the number of days ( days
) can always differ.我的数据表的日期范围、预测变量的数量和天数 (
days
) 总是不同的。 I have already done the model fitting/training for the whole data table once, but I do not know how to do this iteratively every 50
days?我已经对整个数据表进行了一次 model 拟合/训练,但我不知道如何每
50
天迭代一次? Here you can see a code snippet of my model fitting for a linear model:在这里,您可以看到我的 model 的代码片段,适用于线性 model:
v.trainDate <- dt.data$date
## Delete column "date" of train data for model fitting: ##
dt.data <- dt.data[, c("date") := NULL]
## MODEL FITTING: ##
## Linear Model: ##
lmModel <- stats::lm(DE ~ .-1, data = dt.data)
## Train PREDICTION with lmModel: ##
dt.data$prediction <- stats::predict.glm(lmModel, dt.data)
## Add date columns to dt.train: ##
dt.data <- data.table(date = v.trainDate, dt.data)
What I want to have at the end is that I train the model with my data first from 2020-01-01
to 2020-02-20
(first 50 days) and predict the DE
price with this fitted model lmModel
for the first fifty entries of my data table.最后我想要的是,我首先用我的数据从
2020-01-01
年 1 月 1 日到2020-02-20
年 2 月 20 日(前 50 天)训练 model,并使用此拟合的 model lmModel
预测前 50 个条目的DE
价格我的数据表。 Next run should be to train my model from 2020-02-20
to 2020-04-10
(next 50 days) and predict the values for this new 50 days.下一次运行应该是从 2020 年 2 月 20 日到
2020-04-10
年 4 月 10 日(接下来的 50 天)训练我的2020-02-20
,并预测这个新的 50 天的值。 This should be done until the last December day for 2020. At the end I need a column, called prediction
as you can see in my code snippet, but this column should consist the interatively constructed predictions of the DE
price.这应该在 2020 年 12 月的最后一天完成。最后,我需要一列,称为
prediction
,正如您在我的代码片段中看到的那样,但该列应该包含对DE
价格的交互构建的预测。
I would also like to save the Variable Importance somewhere after each iteration?我还想在每次迭代后将变量重要性保存在某处? So that I can see which variable had the most influence on the
DE
price in the first 50 days, etc. Does anyone know how this could work?这样我就可以看到在前 50 天内哪个变量对
DE
价格的影响最大,等等。有谁知道这是如何工作的?
Here would be one way to do it, using the nice features of nested data frames and the map()
function.这是一种方法,使用嵌套数据帧和
map()
function 的良好特性。 Basically create first a dataset of starting dates, then selecting the data relevant for that starting date, then running the regression and extracting the results.基本上首先创建一个开始日期的数据集,然后选择与该开始日期相关的数据,然后运行回归并提取结果。 The final output is shown as a dataset where rows represent the first date of the sample, and columns the date for which the data is predicted.
最终的 output 显示为数据集,其中行表示样本的第一个日期,列表示预测数据的日期。
library(tidyverse)
days <- 50
set.seed(123)
## Create random data table: ##
data <- tibble(date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 366),
"DE" = rnorm(366, 35, 1), "Wind" = rnorm(366, 5000, 2), "Solar" = rnorm(366, 3, 2),
"Nuclear" = rnorm(366, 100, 5), "ResLoad" = rnorm(366, 200, 3), check.names = FALSE)
data_out <- tibble(train_start_date= tail(data$date, -50)) %>%
mutate(data = map(train_start_date, ~filter(data, date >=.x) %>%
head(50)),
lmModel =map(data, ~stats::lm(DE ~ .-1, data = .x)),
prediction= map2(lmModel, data, ~tibble(train_end_date=max(.y$date),
prediction=predict(.x),
prediction_date = .y$date))) %>%
select(train_start_date, prediction) %>%
unnest(prediction) %>%
mutate(train_n_days = as.integer(train_end_date-train_start_date)+1) %>%
select(train_start_date, train_end_date, train_n_days, prediction_date, prediction)
data_out
#> # A tibble: 14,575 x 5
#> train_start_date train_end_date train_n_days prediction_date prediction
#> <date> <date> <dbl> <date> <dbl>
#> 1 2020-02-20 2020-04-09 50 2020-02-20 35.0
#> 2 2020-02-20 2020-04-09 50 2020-02-21 35.0
#> 3 2020-02-20 2020-04-09 50 2020-02-22 35.1
#> 4 2020-02-20 2020-04-09 50 2020-02-23 35.2
#> 5 2020-02-20 2020-04-09 50 2020-02-24 34.6
#> 6 2020-02-20 2020-04-09 50 2020-02-25 35.2
#> 7 2020-02-20 2020-04-09 50 2020-02-26 35.0
#> 8 2020-02-20 2020-04-09 50 2020-02-27 35.0
#> 9 2020-02-20 2020-04-09 50 2020-02-28 35.1
#> 10 2020-02-20 2020-04-09 50 2020-02-29 35.1
#> # … with 14,565 more rows
Created on 2021-02-15 by the reprex package (v1.0.0)由代表 package (v1.0.0) 于 2021 年 2 月 15 日创建
The calclm
function below calculates prediction and the calcImportance
function calculates variable importance.下面的
calclm
function 计算预测, calcImportance
function 计算变量重要性。
by=seq_len(nrow(dt.data)) %/% days
argument using data.table
splits the dataset in days
chuncks and applies the previous functions to each chunck: by=seq_len(nrow(dt.data)) %/% days
参数使用data.table
将数据集拆分为days
块并将前面的函数应用于每个块:
library(data.table)
library(caret)
## Create random data table: ##
dt.data <- data.table(date = seq(as.Date('2020-01-01'), by = '1 day', length.out = 366),
"DE" = rnorm(366, 35, 1), "Wind" = rnorm(366, 5000, 2), "Solar" = rnorm(366, 3, 2),
"Nuclear" = rnorm(366, 100, 5), "ResLoad" = rnorm(366, 200, 3), check.names = FALSE)
set.seed(123)
days <- 50
# Prediction calculation
calcPred <- function(data) {
lmModel <- stats::lm(DE ~ .-1-date, data = data)
stats::predict.glm(lmModel, data)
}
# Importance calculation
calcImportance <- function(data) {
lmModel <- stats::lm(DE ~ .-1-date, data = data)
terms <- attr(lmModel$terms , "term.labels")
varimp <- caret::varImp(lmModel)
importance <- data[,.(date,imp = t(varimp))]
}
importance.data <- data.table::copy(dt.data)
importance.data[,calcImportance(.SD),by=seq_len(nrow(dt.data)) %/% days]
#> seq_len date imp.Wind imp.Solar imp.Nuclear imp.ResLoad
#> 1: 0 2020-01-01 4.598201 2.4726894 0.7993097 1.7153244
#> 2: 0 2020-01-02 4.598201 2.4726894 0.7993097 1.7153244
#> 3: 0 2020-01-03 4.598201 2.4726894 0.7993097 1.7153244
#> 4: 0 2020-01-04 4.598201 2.4726894 0.7993097 1.7153244
#> 5: 0 2020-01-05 4.598201 2.4726894 0.7993097 1.7153244
#> ---
#> 362: 7 2020-12-27 1.093177 0.2558265 0.1610440 0.5383146
#> 363: 7 2020-12-28 1.093177 0.2558265 0.1610440 0.5383146
#> 364: 7 2020-12-29 1.093177 0.2558265 0.1610440 0.5383146
#> 365: 7 2020-12-30 1.093177 0.2558265 0.1610440 0.5383146
#> 366: 7 2020-12-31 1.093177 0.2558265 0.1610440 0.5383146
dt.data[,c('prediction'):=calcPred(.SD),by=seq_len(nrow(dt.data)) %/% days]
dt.data
#> date DE Wind Solar Nuclear ResLoad prediction
#> 1: 2020-01-01 36.51972 5000.608 1.4283653 92.19844 200.1163 35.02625
#> 2: 2020-01-02 34.96544 4999.235 0.3860045 92.29005 203.2613 34.96200
#> 3: 2020-01-03 35.16448 5002.232 -1.4450524 100.32920 202.5225 35.67255
#> 4: 2020-01-04 36.07978 5000.564 -0.9137483 98.07459 206.3788 35.13325
#> 5: 2020-01-05 35.10967 4997.606 4.9788029 101.27625 201.8148 34.29788
#> ---
#> 362: 2020-12-27 34.98190 4997.936 2.6117305 98.33027 195.1352 34.80871
#> 363: 2020-12-28 35.16974 4998.799 1.0776123 108.26064 195.2474 35.01232
#> 364: 2020-12-29 35.37956 4998.651 3.9252237 102.87948 201.0266 35.09362
#> 365: 2020-12-30 35.51517 4999.428 5.9747031 92.38721 196.4204 34.60962
#> 366: 2020-12-31 33.53278 5001.911 3.5062344 93.60744 197.5292 34.85689
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.