I have a dataset of ~32,000, for which I have created a linear model. ~12,000 observations were deleted due to missingness.
I am trying to use the predict function to backtest the expected value for each of my 32,000 data points, but [as expected], this gives the error 'replacement has 20000 rows, data has 32000'.
The best thing to do is to use na.action=na.exclude
when you fit the model in the first place: from ?na.exclude
,
when 'na.exclude' is used the residuals and predictions are padded to the correct length by inserting 'NA's for cases omitted by 'na.exclude'.
Using
data[complete.cases(data),]
gives you only observations without NA
s. Perhaps that's what you are looking for.
Other way is
na.omit(data)
which gives you in addition the indices of the removed observations.
The problem with using a 0 instead of a missing value is that thee linear model will interpret the value as actually having been 0 instead of missing. For instance, if your variable x
had a range of 10-100, the model would interpret your imputed 0's as observations lower than the training data's range and give you artificially low predictions. If you want to make a prediction for the rows with missing values, you're going to have to do some value imputation (ie. replace the NAs with the mean, the median or using k-nearest neighbors).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.