简体   繁体   English

对线性 model 使用预测,NA 值在 R

[英]Using predict for linear model with NA values in R

I have a dataset of ~32,000, for which I have created a linear model.我有一个 ~32,000 的数据集,为此我创建了一个线性 model。 ~12,000 observations were deleted due to missingness.由于缺失,约 12,000 个观测值被删除。

I am trying to use the predict function to backtest the expected value for each of my 32,000 data points, but [as expected], this gives the error 'replacement has 20000 rows, data has 32000'.我正在尝试使用预测 function 来回测我的 32,000 个数据点中的每一个的预期值,但是 [正如预期的那样],这给出了错误“替换有 20000 行,数据有 32000”。

  1. Is there any way I can use that model made on the 20,000 rows to predict that of the 32,000?有什么方法可以使用在 20,000 行上制作的 model 来预测 32,000 行中的 model 吗? I am happy to have 'zero' for observations that don't have results for every column used in the model.对于 model 中使用的每一列都没有结果的观察结果,我很高兴得到“零”。
  2. If not, how can I at least subset the 32,000 dataset correctly such that it only includes the 20,000 whole observations?如果不是,我如何至少正确地对 32,000 个数据集进行子集化,使其仅包含 20,000 个完整的观察值? If my model was lm(a ~ x+y+Z, data=data), for example, how would I filter data to only include observations with full data in x, y and z?例如,如果我的 model 是 lm(a ~ x+y+Z, data=data),我将如何过滤数据以仅包含 x、y 和 z 中的完整数据的观察?

The best thing to do is to use na.action=na.exclude when you fit the model in the first place: from ?na.exclude ,最好的办法是在首先安装 model 时使用na.action=na.exclude :来自?na.exclude

when 'na.exclude' is used the residuals and predictions are padded to the correct length by inserting 'NA's for cases omitted by 'na.exclude'.当使用“na.exclude”时,残差和预测被填充到正确的长度,方法是为“na.exclude”省略的情况插入“NA”。

Using使用

data[complete.cases(data),]

gives you only observations without NA s.只为您提供没有NA的观察结果。 Perhaps that's what you are looking for.也许这就是你要找的。

Other way is另一种方式是

na.omit(data)

which gives you in addition the indices of the removed observations.它还为您提供了已删除观测值的索引。

The problem with using a 0 instead of a missing value is that thee linear model will interpret the value as actually having been 0 instead of missing.使用 0 而不是缺失值的问题在于,线性 model 会将值解释为实际上是 0 而不是缺失。 For instance, if your variable x had a range of 10-100, the model would interpret your imputed 0's as observations lower than the training data's range and give you artificially low predictions.例如,如果您的变量x的范围为 10-100,则 model 会将您估算的 0 解释为低于训练数据范围的观察值,并人为地为您提供低预测。 If you want to make a prediction for the rows with missing values, you're going to have to do some value imputation (ie. replace the NAs with the mean, the median or using k-nearest neighbors).如果要对缺失值的行进行预测,则必须进行一些值插补(即,将 NA 替换为均值、中位数或使用 k 最近邻)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM