Using predict for linear model with NA values in R

Question

I have a dataset of ~32,000, for which I have created a linear model. ~12,000 observations were deleted due to missingness.

I am trying to use the predict function to backtest the expected value for each of my 32,000 data points, but [as expected], this gives the error 'replacement has 20000 rows, data has 32000'.

Is there any way I can use that model made on the 20,000 rows to predict that of the 32,000? I am happy to have 'zero' for observations that don't have results for every column used in the model.
If not, how can I at least subset the 32,000 dataset correctly such that it only includes the 20,000 whole observations? If my model was lm(a ~ x+y+Z, data=data), for example, how would I filter data to only include observations with full data in x, y and z?

Answer 1

The best thing to do is to use na.action=na.exclude when you fit the model in the first place: from ?na.exclude ,

when 'na.exclude' is used the residuals and predictions are padded to the correct length by inserting 'NA's for cases omitted by 'na.exclude'.

Answer 2

Using

data[complete.cases(data),]

gives you only observations without NA s. Perhaps that's what you are looking for.

Other way is

na.omit(data)

which gives you in addition the indices of the removed observations.

Answer 3

The problem with using a 0 instead of a missing value is that thee linear model will interpret the value as actually having been 0 instead of missing. For instance, if your variable x had a range of 10-100, the model would interpret your imputed 0's as observations lower than the training data's range and give you artificially low predictions. If you want to make a prediction for the rows with missing values, you're going to have to do some value imputation (ie. replace the NAs with the mean, the median or using k-nearest neighbors).

Using predict for linear model with NA values in R

Question

3 answers

solution1
1 2020-05-17 00:35:20

solution2
0 2020-05-17 00:20:18

solution3
0 2020-05-17 00:41:34

Using predict for linear model with NA values in R

Question

3 answers

solution1 1 2020-05-17 00:35:20

solution2 0 2020-05-17 00:20:18

solution3 0 2020-05-17 00:41:34

solution1
1 2020-05-17 00:35:20

solution2
0 2020-05-17 00:20:18

solution3
0 2020-05-17 00:41:34