[英]Using predict for linear model with NA values in R
I have a dataset of ~32,000, for which I have created a linear model.我有一个 ~32,000 的数据集,为此我创建了一个线性 model。 ~12,000 observations were deleted due to missingness.
由于缺失,约 12,000 个观测值被删除。
I am trying to use the predict function to backtest the expected value for each of my 32,000 data points, but [as expected], this gives the error 'replacement has 20000 rows, data has 32000'.我正在尝试使用预测 function 来回测我的 32,000 个数据点中的每一个的预期值,但是 [正如预期的那样],这给出了错误“替换有 20000 行,数据有 32000”。
The best thing to do is to use na.action=na.exclude
when you fit the model in the first place: from ?na.exclude
,最好的办法是在首先安装 model 时使用
na.action=na.exclude
:来自?na.exclude
,
when 'na.exclude' is used the residuals and predictions are padded to the correct length by inserting 'NA's for cases omitted by 'na.exclude'.
当使用“na.exclude”时,残差和预测被填充到正确的长度,方法是为“na.exclude”省略的情况插入“NA”。
Using使用
data[complete.cases(data),]
gives you only observations without NA
s.只为您提供没有
NA
的观察结果。 Perhaps that's what you are looking for.也许这就是你要找的。
Other way is另一种方式是
na.omit(data)
which gives you in addition the indices of the removed observations.它还为您提供了已删除观测值的索引。
The problem with using a 0 instead of a missing value is that thee linear model will interpret the value as actually having been 0 instead of missing.使用 0 而不是缺失值的问题在于,线性 model 会将值解释为实际上是 0 而不是缺失。 For instance, if your variable
x
had a range of 10-100, the model would interpret your imputed 0's as observations lower than the training data's range and give you artificially low predictions.例如,如果您的变量
x
的范围为 10-100,则 model 会将您估算的 0 解释为低于训练数据范围的观察值,并人为地为您提供低预测。 If you want to make a prediction for the rows with missing values, you're going to have to do some value imputation (ie. replace the NAs with the mean, the median or using k-nearest neighbors).如果要对缺失值的行进行预测,则必须进行一些值插补(即,将 NA 替换为均值、中位数或使用 k 最近邻)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.