简体   繁体   中英

fit linear regression model for a variable that depends on past values in R

I am working on a model that is similar to time series prediction.

I have to fit a linear regression model to a target variable(TV) which has two other dependent variables(X and Y) and also on its own past values.

Basically the model looks like this:

TV(t) ~ X(t) + Y(t) + TV(t-1) + TV(t-2) + TV(t-3)

I got stuck attempting at converting this R code

model <- lm(modeldata$TV ~ modeldata$X  +modeldata$Y+ ??)

How do i write the R code to fit this kind of model?.

One of the possible solutions is to use the Hadley Wickham's dplyr package and its lag() function. Here is a complete example. We first create a simple modeldata.

modeldata <- data.frame(X=1:10, Y=1:10, TV=1:10)
modeldata
X  Y TV
1   1  1  1
2   2  2  2
3   3  3  3
4   4  4  4
5   5  5  5
6   6  6  6
7   7  7  7
8   8  8  8
9   9  9  9
10 10 10 10

Then we load dplyr package and use its mutate() function. We create new columns in the data frame using lag() function.

library(dplyr)
modeldata <- mutate(modeldata, TVm1 = lag(TV,1), TVm2 = lag(TV,2), TVm3 = lag(TV, 3))
modeldata
X  Y TV TVm1 TVm2 TVm3
1   1  1  1   NA   NA   NA
2   2  2  2    1   NA   NA
3   3  3  3    2    1   NA
4   4  4  4    3    2    1
5   5  5  5    4    3    2
6   6  6  6    5    4    3
7   7  7  7    6    5    4
8   8  8  8    7    6    5
9   9  9  9    8    7    6
10 10 10 10    9    8    7

Lastly we provide all variables from our data frame (using ~. notation) to lm() function.

model <- lm(TV ~ ., data = modeldata)

To obtain predictions based on this model, we have to prepare test set in the same way.

testdata <- data.frame(X = 11:15, Y = 11:15, TV = 11:15)
testdata <- mutate(testdata, TVm1 = lag(TV,1), TVm2 = lag(TV,2), TVm3 = lag(TV, 3))
predict(model, newdata = testdata)

In this case we can obtain prediction only for observation 14 and 15 in testdata. For earlier observations, we are not able to calculate all lag values.

Of course, we assume that we have some kind of time series data. Otherwise, it is not possible to fit and use such model.

You need to build the proper dataset before sending to lm . Some lag functions exist: one in the dply package and a different one for use with time series objects. You might get a quick approach to creating a lagged version of TV with:

 laggedVar <- embed(Var, 4)

Eg

> embed(1:10, 4)
     [,1] [,2] [,3] [,4]
[1,]    4    3    2    1
[2,]    5    4    3    2
[3,]    6    5    4    3
[4,]    7    6    5    4
[5,]    8    7    6    5
[6,]    9    8    7    6
[7,]   10    9    8    7

You might also look at the regression methods designed for use with panel data that might be expected to have some degree of auto-correlation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM