简体   繁体   中英

Can the subset() function within the lm() R function can be used to remove observations only of certain variables?

I am not sure my question makes sense. But, I am considering modifying an econometrics model using time series data. It is a multiple regression. One of the independent variables is the 5 year Treasury rate. This variable is split over two time periods. One variable is the 5 year Treasury rate from 1950 to 1986. After 1986 this variable takes the value of 0. The second one is 5 year Treasury rate from 1986 to the present. Before 1986, this second variable has values of 0. Someone suggested I replace the 0 values with blanks (equivalent to missing data). Because as suggested, those variables' meanings would be supposedly better specified. Could you do that with the subset() function. In other words, could you in effect remove or ignore the 0 values from those variables without actually removing or ignoring the entire row of data, and remove all the values from the other independent variables. I know this coding question is contingent on whether this process even makes sense. I am not sure it does. I have passed the theoretical question by Cross Validated. But, I am not sure I will get any answer. I figured I would go ahead and ask the coding question here.

Assuming your data is in a data frame, the answer is "no." You cannot use subset on only part of a data.frame . That's because subset on a data frame returns another data frame, and in a data frame all of the variables must be the same length.

There are plenty of ways to work around this restriction, but they won't work with lm . Think about how regression works: every observation must be fully observed. If you have missing data, you have three options:

  1. Delete the observations with missing data. This is called listwise deletion and it is the default in lm (by way of the na.omit function, buried inside the model.matrix function, which is inside lm )
  2. Impute the missing data. This is a massive field and and area of active research
  3. Use some kind of other method, like a Bayesian model that can integrate over the missing data

You should be able to get help in this area from Cross Validated. But the fact remains, there is simply no way to use lm on variables of unequal length, and there is no way to get subset to return a data frame containing variables of unequal length because all variables in a data frame must be the same length.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM