对线性 model 使用预测，NA 值在 R

Question

I have a dataset of ~32,000, for which I have created a linear model.我有一个 ~32,000 的数据集，为此我创建了一个线性 model。 ~12,000 observations were deleted due to missingness.由于缺失，约 12,000 个观测值被删除。

I am trying to use the predict function to backtest the expected value for each of my 32,000 data points, but [as expected], this gives the error 'replacement has 20000 rows, data has 32000'.我正在尝试使用预测 function 来回测我的 32,000 个数据点中的每一个的预期值，但是 [正如预期的那样]，这给出了错误“替换有 20000 行，数据有 32000”。

Is there any way I can use that model made on the 20,000 rows to predict that of the 32,000?有什么方法可以使用在 20,000 行上制作的 model 来预测 32,000 行中的 model 吗？ I am happy to have 'zero' for observations that don't have results for every column used in the model.对于 model 中使用的每一列都没有结果的观察结果，我很高兴得到“零”。
If not, how can I at least subset the 32,000 dataset correctly such that it only includes the 20,000 whole observations?如果不是，我如何至少正确地对 32,000 个数据集进行子集化，使其仅包含 20,000 个完整的观察值？ If my model was lm(a ~ x+y+Z, data=data), for example, how would I filter data to only include observations with full data in x, y and z?例如，如果我的 model 是 lm(a ~ x+y+Z, data=data)，我将如何过滤数据以仅包含 x、y 和 z 中的完整数据的观察？

Answer 1

The best thing to do is to use na.action=na.exclude when you fit the model in the first place: from ?na.exclude ,最好的办法是在首先安装 model 时使用na.action=na.exclude ：来自?na.exclude ，

when 'na.exclude' is used the residuals and predictions are padded to the correct length by inserting 'NA's for cases omitted by 'na.exclude'.当使用“na.exclude”时，残差和预测被填充到正确的长度，方法是为“na.exclude”省略的情况插入“NA”。

Answer 2

Using使用

data[complete.cases(data),]

gives you only observations without NA s.只为您提供没有NA的观察结果。 Perhaps that's what you are looking for.也许这就是你要找的。

Other way is另一种方式是

na.omit(data)

which gives you in addition the indices of the removed observations.它还为您提供了已删除观测值的索引。

Answer 3

The problem with using a 0 instead of a missing value is that thee linear model will interpret the value as actually having been 0 instead of missing.使用 0 而不是缺失值的问题在于，线性 model 会将值解释为实际上是 0 而不是缺失。 For instance, if your variable x had a range of 10-100, the model would interpret your imputed 0's as observations lower than the training data's range and give you artificially low predictions.例如，如果您的变量x的范围为 10-100，则 model 会将您估算的 0 解释为低于训练数据范围的观察值，并人为地为您提供低预测。 If you want to make a prediction for the rows with missing values, you're going to have to do some value imputation (ie. replace the NAs with the mean, the median or using k-nearest neighbors).如果要对缺失值的行进行预测，则必须进行一些值插补（即，将 NA 替换为均值、中位数或使用 k 最近邻）。

对线性 model 使用预测，NA 值在 R

问题描述

3 个解决方案

解决方案1
1 2020-05-17 00:35:20

解决方案2
0 2020-05-17 00:20:18

解决方案3
0 2020-05-17 00:41:34

对线性 model 使用预测，NA 值在 R

问题描述

3 个解决方案

解决方案1 1 2020-05-17 00:35:20

解决方案2 0 2020-05-17 00:20:18

解决方案3 0 2020-05-17 00:41:34

解决方案1
1 2020-05-17 00:35:20

解决方案2
0 2020-05-17 00:20:18

解决方案3
0 2020-05-17 00:41:34