简体   繁体   English

具有lm的线性模型:如何获取预测值总和的预测方差

[英]linear model with `lm`: how to get prediction variance of sum of predicted values

I'm summing the predicted values from a linear model with multiple predictors, as in the example below, and want to calculate the combined variance, standard error and possibly confidence intervals for this sum. 我正在对具有多个预测变量的线性模型的预测值求和,如下面的示例所示,并希望计算该总和的组合方差,标准误差和可能的置信区间。

lm.tree <- lm(Volume ~ poly(Girth,2), data = trees)

Suppose I have a set of Girths : 假设我有一组Girths

newdat <- list(Girth = c(10,12,14,16)

for which I want to predict the total Volume : 为此,我要预测总Volume

pr <- predict(lm.tree, newdat, se.fit = TRUE)
total <- sum(pr$fit)
# [1] 111.512

How can I obtain the variance for total ? 如何获得total的方差?

Similar questions are here (for GAMs) , but I'm not sure how to proceed with the vcov(lm.trees) . 类似的问题在这里(针对GAM) ,但是我不确定如何继续vcov(lm.trees) I'd be grateful for a reference for the method. 我希望为该方法提供参考。

You need to obtain full variance-covariance matrix, then sum all its elements. 您需要获取完整的方差-协方差矩阵,然后将其所有元素求和。 Here is small proof: 这是个小证明:

在此处输入图片说明

The proof here is using another theorem, which you can find from Covariance-wikipedia : 这里的证明使用了另一个定理,您可以从Covariance-wikipedia中找到:

在此处输入图片说明

Specifically, the linear transform we take is a column matrix of all 1's. 具体来说,我们采用的线性变换是全1的列矩阵。 The resulting quadratic form is computed as following , with all x_i and x_j being 1. 将得到的二次形式被计算如下 ,与所有x_ix_j为1。

在此处输入图片说明


Setup 设定

## your model
lm.tree <- lm(Volume ~ poly(Girth, 2), data = trees)

## newdata (a data frame)
newdat <- data.frame(Girth = c(10, 12, 14, 16))

Re-implement predict.lm to compute variance-covariance matrix 重新实现predict.lm以计算方差-协方差矩阵

See How does predict.lm() compute confidence interval and prediction interval? 请参见predict.lm()如何计算置信区间和预测区间? for how predict.lm works. 有关predict.lm工作方式。 The following small function lm_predict mimics what it does, except that 以下小型函数lm_predict模仿了它的功能,除了

  • it does not construct confidence or prediction interval (but construction is very straightforward as explained in that Q & A); 它不会构造置信度或预测间隔(但正如该问答所解释的那样,构造非常简单);
  • it can compute complete variance-covariance matrix of predicted values if diag = FALSE ; 如果diag = FALSE ,它可以计算完整的预测值方差-协方差矩阵;
  • it returns variance (for both predicted values and residuals), not standard error; 它返回方差(针对预测值和残差),而不是标准误差;
  • it can not do type = "terms" ; 它不能做type = "terms" ; it only predict response variable. 它只预测响应变量。

lm_predict <- function (lmObject, newdata, diag = TRUE) {
  ## input checking
  if (!inherits(lmObject, "lm")) stop("'lmObject' is not a valid 'lm' object!")
  ## extract "terms" object from the fitted model, but delete response variable
  tm <- delete.response(terms(lmObject))      
  ## linear predictor matrix
  Xp <- model.matrix(tm, newdata)
  ## predicted values by direct matrix-vector multiplication
  pred <- c(Xp %*% coef(lmObject))
  ## efficiently form the complete variance-covariance matrix
  QR <- lmObject$qr   ## qr object of fitted model
  piv <- QR$pivot     ## pivoting index
  r <- QR$rank        ## model rank / numeric rank
  if (is.unsorted(piv)) {
    ## pivoting has been done
    B <- forwardsolve(t(QR$qr), t(Xp[, piv]), r)
    } else {
    ## no pivoting is done
    B <- forwardsolve(t(QR$qr), t(Xp), r)
    }
  ## residual variance
  sig2 <- c(crossprod(residuals(lmObject))) / df.residual(lmObject)
  if (diag) {
    ## return point-wise prediction variance
    VCOV <- colSums(B ^ 2) * sig2
    } else {
    ## return full variance-covariance matrix of predicted values
    VCOV <- crossprod(B) * sig2
    }
  list(fit = pred, var.fit = VCOV, df = lmObject$df.residual, residual.var = sig2)
  }

We can compare its output with that of predict.lm : 我们可以将其输出与predict.lm进行比较:

predict.lm(lm.tree, newdat, se.fit = TRUE)
#$fit
#       1        2        3        4 
#15.31863 22.33400 31.38568 42.47365 
#
#$se.fit
#        1         2         3         4 
#0.9435197 0.7327569 0.8550646 0.8852284 
#
#$df
#[1] 28
#
#$residual.scale
#[1] 3.334785

lm_predict(lm.tree, newdat)
#$fit
#[1] 15.31863 22.33400 31.38568 42.47365
#
#$var.fit    ## the square of `se.fit`
#[1] 0.8902294 0.5369327 0.7311355 0.7836294
#
#$df
#[1] 28
#
#$residual.var   ## the square of `residual.scale`
#[1] 11.12079

And in particular: 特别是:

oo <- lm_predict(lm.tree, newdat, FALSE)
oo
#$fit
#[1] 15.31863 22.33400 31.38568 42.47365
#
#$var.fit
#            [,1]      [,2]       [,3]       [,4]
#[1,]  0.89022938 0.3846809 0.04967582 -0.1147858
#[2,]  0.38468089 0.5369327 0.52828797  0.3587467
#[3,]  0.04967582 0.5282880 0.73113553  0.6582185
#[4,] -0.11478583 0.3587467 0.65821848  0.7836294
#
#$df
#[1] 28
#
#$residual.var
#[1] 11.12079

Note that the variance-covariance matrix is not computed in a naive way: Xp %*% vcov(lmObject) % t(Xp) , which is slow. 请注意,方差-协方差矩阵不是天真的: Xp %*% vcov(lmObject) % t(Xp) ,这很慢。

Aggregation (sum) 汇总(总和)

In your case, the aggregation operation is the sum of all values in oo$fit . 在您的情况下,聚合操作是oo$fit中所有值的总和。 The mean and variance of this aggregation are 此聚合的均值和方差为

sum_mean <- sum(oo$fit)  ## mean of the sum
# 111.512

sum_variance <- sum(oo$var.fit)  ## variance of the sum
# 6.671575

You can further construct confidence interval (CI) for this aggregated value, by using t-distribution and the residual degree of freedom in the model. 您可以通过使用t分布和模型中的剩余自由度来进一步构造此聚合值的置信区间(CI)。

alpha <- 0.95
Qt <- c(-1, 1) * qt((1 - alpha) / 2, lm.tree$df.residual, lower.tail = FALSE)
#[1] -2.048407  2.048407

## %95 CI
sum_mean + Qt * sqrt(sum_variance)
#[1] 106.2210 116.8029

Constructing prediction interval (PI) needs further account for residual variance. 构建预测间隔(PI)需要进一步考虑残差。

## adjusted variance-covariance matrix
VCOV_adj <- with(oo, var.fit + diag(residual.var, nrow(var.fit)))

## adjusted variance for the aggregation
sum_variance_adj <- sum(VCOV_adj)  ## adjusted variance of the sum

## 95% PI
sum_mean + Qt * sqrt(sum_variance_adj)
#[1]  96.86122 126.16268

Aggregation (in general) 汇总(一般)

A general aggregation operation can be a linear combination of oo$fit : 一般的聚合操作可以是oo$fit的线性组合:

w[1] * fit[1] + w[2] * fit[2] + w[3] * fit[3] + ...

For example, the sum operation has all weights being 1; 例如,求和运算的所有权重均为1; the mean operation has all weights being 0.25 (in case of 4 data). 平均运算的所有权重为0.25(如果有4个数据)。 Here is function that takes a weight vector, a significance level and what is returned by lm_predict to produce statistics of an aggregation. 这是一个函数,它采用一个权重向量,一个显着性水平以及lm_predict返回的lm_predict以生成聚合统计信息。

agg_pred <- function (w, predObject, alpha = 0.95) {
  ## input checing
  if (length(w) != length(predObject$fit)) stop("'w' has wrong length!")
  if (!is.matrix(predObject$var.fit)) stop("'predObject' has no variance-covariance matrix!")
  ## mean of the aggregation
  agg_mean <- c(crossprod(predObject$fit, w))
  ## variance of the aggregation
  agg_variance <- c(crossprod(w, predObject$var.fit %*% w))
  ## adjusted variance-covariance matrix
  VCOV_adj <- with(predObject, var.fit + diag(residual.var, nrow(var.fit)))
  ## adjusted variance of the aggregation
  agg_variance_adj <- c(crossprod(w, VCOV_adj %*% w))
  ## t-distribution quantiles
  Qt <- c(-1, 1) * qt((1 - alpha) / 2, predObject$df, lower.tail = FALSE)
  ## names of CI and PI
  NAME <- c("lower", "upper")
  ## CI
  CI <- setNames(agg_mean + Qt * sqrt(agg_variance), NAME)
  ## PI
  PI <- setNames(agg_mean + Qt * sqrt(agg_variance_adj), NAME)
  ## return
  list(mean = agg_mean, var = agg_variance, CI = CI, PI = PI)
  }

A quick test on the previous sum operation: 快速测试以前的求和运算:

agg_pred(rep(1, length(oo$fit)), oo)
#$mean
#[1] 111.512
#
#$var
#[1] 6.671575
#
#$CI
#   lower    upper 
#106.2210 116.8029 
#
#$PI
#    lower     upper 
# 96.86122 126.16268 

And a quick test for average operation: 并对平均运行情况进行快速测试:

agg_pred(rep(1, length(oo$fit)) / length(oo$fit), oo)
#$mean
#[1] 27.87799
#
#$var
#[1] 0.4169734
#
#$CI
#   lower    upper 
#26.55526 29.20072 
#
#$PI
#   lower    upper 
#24.21531 31.54067 

Remark 备注

This answer is improved to provide easy-to-use functions for Linear regression with `lm()`: prediction interval for aggregated predicted values . 改进了该答案,以提供具有lm()的线性回归的易于使用的函数:聚合预测值的预测间隔


Upgrade (for big data) 升级(用于大数据)

This is great! 这很棒! Thank you so much! 非常感谢! There is one thing I forgot to mention: in my actual application I need to sum ~300,000 predictions which would create a full variance-covariance matrix which is about ~700GB in size. 我忘了提到一件事:在我的实际应用中,我需要汇总约300,000个预测,这将创建一个完整的方差-协方差矩阵,其大小约为700GB。 Do you have any idea if there is a computationally more efficient way to directly get to the sum of the variance-covariance matrix? 您是否知道是否存在一种计算上更有效的方法来直接求出方差-协方差矩阵的和?

Thanks to the OP of Linear regression with `lm()`: prediction interval for aggregated predicted values for this very helpful comment. 感谢使用带有lm()线性回归的OP 这个非常有用的注释的预测值的预测间隔 Yes, it is possible and it is also (significantly) computationally cheaper. 是的,这是可能的,并且在计算上也更便宜。 At the moment, lm_predict form the variance-covariance as such: 此刻, lm_predict形成方差-协方差:

在此处输入图片说明

agg_pred computes the prediction variance (for constructing CI) as a quadratic form: w'(B'B)w , and the prediction variance (for construction PI) as another quadratic form w'(B'B + D)w , where D is a diagonal matrix of residual variance. agg_pred将预测方差(用于构造CI)计算为二次形式: w'(B'B)w ,并将预测方差(用于构造PI)计算为另一次二次形式w'(B'B + D)w ,其中D是残差方差的对角矩阵。 Obviously if we fuse those two functions, we have a better computational strategy: 显然,如果我们将这两个功能融合在一起,我们将拥有更好的计算策略:

在此处输入图片说明

Computation of B and B'B is avoided; 避免计算BB'B we have replaced all matrix-matrix multiplication to matrix-vector multiplication. 我们已将所有矩阵矩阵乘法替换为矩阵向量乘法。 There is no memory storage for B and B'B ; BB'B没有存储空间; only for u which is just a vector. 仅用于u ,它只是一个向量。 Here is the fused implementation. 这是融合的实现。

## this function requires neither `lm_predict` nor `agg_pred`
fast_agg_pred <- function (w, lmObject, newdata, alpha = 0.95) {
  ## input checking
  if (!inherits(lmObject, "lm")) stop("'lmObject' is not a valid 'lm' object!")
  if (!is.data.frame(newdata)) newdata <- as.data.frame(newdata)
  if (length(w) != nrow(newdata)) stop("length(w) does not match nrow(newdata)")
  ## extract "terms" object from the fitted model, but delete response variable
  tm <- delete.response(terms(lmObject))      
  ## linear predictor matrix
  Xp <- model.matrix(tm, newdata)
  ## predicted values by direct matrix-vector multiplication
  pred <- c(Xp %*% coef(lmObject))
  ## mean of the aggregation
  agg_mean <- c(crossprod(pred, w))
  ## residual variance
  sig2 <- c(crossprod(residuals(lmObject))) / df.residual(lmObject)
  ## efficiently compute variance of the aggregation without matrix-matrix computations
  QR <- lmObject$qr   ## qr object of fitted model
  piv <- QR$pivot     ## pivoting index
  r <- QR$rank        ## model rank / numeric rank
  u <- forwardsolve(t(QR$qr), c(crossprod(Xp, w))[piv], r)
  agg_variance <- c(crossprod(u)) * sig2
  ## adjusted variance of the aggregation
  agg_variance_adj <- agg_variance + c(crossprod(w)) * sig2
  ## t-distribution quantiles
  Qt <- c(-1, 1) * qt((1 - alpha) / 2, lmObject$df.residual, lower.tail = FALSE)
  ## names of CI and PI
  NAME <- c("lower", "upper")
  ## CI
  CI <- setNames(agg_mean + Qt * sqrt(agg_variance), NAME)
  ## PI
  PI <- setNames(agg_mean + Qt * sqrt(agg_variance_adj), NAME)
  ## return
  list(mean = agg_mean, var = agg_variance, CI = CI, PI = PI)
  }

Let's have a quick test. 让我们进行快速测试。

## sum opeartion
fast_agg_pred(rep(1, nrow(newdat)), lm.tree, newdat)
#$mean
#[1] 111.512
#
#$var
#[1] 6.671575
#
#$CI
#   lower    upper 
#106.2210 116.8029 
#
#$PI
#    lower     upper 
# 96.86122 126.16268 

## average operation
fast_agg_pred(rep(1, nrow(newdat)) / nrow(newdat), lm.tree, newdat)
#$mean
#[1] 27.87799
#
#$var
#[1] 0.4169734
#
#$CI
#   lower    upper 
#26.55526 29.20072 
#
#$PI
#   lower    upper 
#24.21531 31.54067 

Yes, the answer is correct! 是的,答案是正确的!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM