是否有簡單回歸的快速估計（只有截距和斜率的回歸線）？

Question

這個問題與機器學習特征選擇程序有關。

我有一個很大的特征矩陣 - 列是主題（行）的特征：

set.seed(1)
features.mat <- matrix(rnorm(10*100),ncol=100)
colnames(features.mat) <- paste("F",1:100,sep="")
rownames(features.mat) <- paste("S",1:10,sep="")

在不同條件 ( C ) 下測量每個受試者 ( S ) 的響應，因此看起來像這樣：

response.df <-
data.frame(S = c(sapply(1:10, function(x) rep(paste("S", x, sep = ""),100))),
           C = rep(paste("C", 1:100, sep = ""), 10),
           response = rnorm(1000), stringsAsFactors = F)

所以我匹配response.df中的主題：

match.idx <- match(response.df$S, rownames(features.mat))

我正在尋找一種快速方法來計算每個特征和響應的單變量回歸。

還有比這更快的嗎？：

fun <- function(f){
  fit <- lm(response.df$response ~ features.mat[match.idx,f])
  beta <- coef(summary(fit))
  data.frame(feature = colnames(features.mat)[f], effect = beta[2,1],
             p.val = beta[2,4], stringsAsFactors = F))
  }

res <- do.call(rbind, lapply(1:ncol(features.mat), fun))

我對邊際提升感興趣，即通過mclapply或mclapply2使用並行計算以外的方法。

Answer 1

我會提供一個輕量級的玩具例程來估計一個簡單的回歸模型： y ~ x ，即一條只有截距和斜率的回歸線。 可以看出，這比lm + summary.lm快 36 倍。

## toy data
set.seed(0)
x <- runif(50)
y <- 0.3 * x + 0.1 + rnorm(50, sd = 0.05)

## fast estimation of simple linear regression: y ~ x 
simplelm <- function (x, y) {
  ## number of data
  n <- length(x)
  ## centring
  y0 <- sum(y) / length(y); yc <- y - y0
  x0 <- sum(x) / length(x); xc <- x - x0
  ## fitting an intercept-free model: yc ~ xc + 0
  xty <- c(crossprod(xc, yc))
  xtx <- c(crossprod(xc))
  slope <- xty / xtx
  rc <- yc - xc * slope
  ## Pearson estimate of residual standard error
  sigma2 <- c(crossprod(rc)) / (n - 2)
  ## standard error for slope
  slope_se <- sqrt(sigma2 / xtx)
  ## t-score and p-value for slope
  tscore <- slope / slope_se
  pvalue <- 2 * pt(abs(tscore), n - 2, lower.tail = FALSE)
  ## return estimation summary for slope
  c("Estimate" = slope, "Std. Error" = slope_se, "t value" = tscore, "Pr(>|t|)" = pvalue)
  }

我們來做個測試：

simplelm(x, y)

#    Estimate   Std. Error      t value     Pr(>|t|) 
#2.656737e-01 2.279663e-02 1.165408e+01 1.337380e-15

另一方面， lm + summary.lm給出：

coef(summary(lm(y ~ x)))

#             Estimate Std. Error   t value     Pr(>|t|)
#(Intercept) 0.1154549 0.01373051  8.408633 5.350248e-11
#x           0.2656737 0.02279663 11.654079 1.337380e-15

所以結果匹配。 如果您需要 R 平方和調整后的 R 平方，也可以輕松計算。

讓我們有一個基准：

set.seed(0)
x <- runif(10000)
y <- 0.3 * x + 0.1 + rnorm(10000, sd = 0.05)

library(microbenchmark)

microbenchmark(coef(summary(lm(y ~ x))), simplelm(x, y))

#Unit: microseconds
#                     expr      min       lq       mean   median       uq
# coef(summary(lm(y ~ x))) 14158.28 14305.28 17545.1544 14444.34 17089.00
#           simplelm(x, y)   235.08   265.72   485.4076   288.20   319.46
#      max neval cld
# 114662.2   100   b
#   3409.6   100  a

聖！！！ 我們有36倍的提升！

Remark-1（求解正規方程）

simplelm基於通過 Cholesky 分解求解正規方程。 但由於它很簡單，不涉及實際的矩陣計算。 如果我們需要使用多個協變量進行回歸，我們可以使用在我的這個答案中定義的lm.chol 。

正規方程也可以通過使用 LU 分解來求解。 我不會涉及這個，但如果你有興趣，這里是：求解正規方程給出不同的系數使用lm ？ .

Remark-2（通過`cor.test`替代）

該simplelm是一個擴展fastsim在我的答案2個布朗運動之間的相關性的蒙特卡洛模擬（連續隨機游走）。 另一種方法是基於cor.test 。 它也比lm + summary.lm快得多，但如該答案所示，它仍然比我上面的建議慢。

備注3（通過QR方法替代）

基於 QR 的方法也是可能的，在這種情況下，我們想使用.lm.fit ，一個輕量級的包裝器，用於qr.default 、 qr.coef 、 qr.fitted和qr.resid在 C 級。 以下是我們如何將此選項添加到我們的simplelm ：

## fast estimation of simple linear regression: y ~ x 
simplelm <- function (x, y, QR = FALSE) {
  ## number of data
  n <- length(x)
  ## centring
  y0 <- sum(y) / length(y); yc <- y - y0
  x0 <- sum(x) / length(x); xc <- x - x0
  ## fitting intercept free model: yc ~ xc + 0
  if (QR) {
    fit <- .lm.fit(matrix(xc), yc)
    slope <- fit$coefficients
    rc <- fit$residuals
    } else {
    xty <- c(crossprod(xc, yc))
    xtx <- c(crossprod(xc))
    slope <- xty / xtx
    rc <- yc - xc * slope
    }
  ## Pearson estimate of residual standard error
  sigma2 <- c(crossprod(rc)) / (n - 2)
  ## standard error for slope
  if (QR) {
    slope_se <- sqrt(sigma2) / abs(fit$qr[1])
    } else {
    slope_se <- sqrt(sigma2 / xtx)
    }
  ## t-score and p-value for slope
  tscore <- slope / slope_se
  pvalue <- 2 * pt(abs(tscore), n - 2, lower.tail = FALSE)
  ## return estimation summary for slope
  c("Estimate" = slope, "Std. Error" = slope_se, "t value" = tscore, "Pr(>|t|)" = pvalue)
  }

對於我們的玩具數據，QR 方法和 Cholesky 方法都給出了相同的結果：

set.seed(0)
x <- runif(50)
y <- 0.3 * x + 0.1 + rnorm(50, sd = 0.05)

simplelm(x, y, TRUE)

#    Estimate   Std. Error      t value     Pr(>|t|) 
#2.656737e-01 2.279663e-02 1.165408e+01 1.337380e-15 

simplelm(x, y, FALSE)

#    Estimate   Std. Error      t value     Pr(>|t|) 
#2.656737e-01 2.279663e-02 1.165408e+01 1.337380e-15

已知 QR 方法比 Cholesky 方法慢 2 ~ 3 倍（閱讀我的回答為什么 R 中的內置 lm 函數如此慢？詳細解釋）。 這是一個快速檢查：

set.seed(0)
x <- runif(10000)
y <- 0.3 * x + 0.1 + rnorm(10000, sd = 0.05)

library(microbenchmark)

microbenchmark(simplelm(x, y, TRUE), simplelm(x, y))

#Unit: microseconds
#                 expr    min     lq      mean median     uq     max neval cld
# simplelm(x, y, TRUE) 776.88 873.26 1073.1944 908.72 933.82 3420.92   100   b
#       simplelm(x, y) 238.32 292.02  441.9292 310.44 319.32 3515.08   100  a

所以確實， 908 / 310 = 2.93 。

Remark-4（GLM 的簡單回歸）

如果我們繼續使用 GLM，還有一個基於glm.fit的快速、輕量級版本。 您可以閱讀我的回答R 循環幫助：省略一個觀察並一次運行 glm 一個變量並使用在那里定義的函數f 。 目前f是針對邏輯回歸定制的，但我們可以輕松地將其推廣到其他響應。

是否有簡單回歸的快速估計（只有截距和斜率的回歸線）？

問題描述

1 個解決方案

解決方案1
8 2016-10-19 22:38:17

Remark-1（求解正規方程）

Remark-2（通過`cor.test`替代）

備注3（通過QR方法替代）

Remark-4（GLM 的簡單回歸）

是否有簡單回歸的快速估計（只有截距和斜率的回歸線）？

問題描述

1 個解決方案

解決方案1 8 2016-10-19 22:38:17

Remark-1（求解正規方程）

Remark-2（通過cor.test替代）

備注3（通過QR方法替代）

Remark-4（GLM 的簡單回歸）

解決方案1
8 2016-10-19 22:38:17

Remark-2（通過`cor.test`替代）