简体   繁体   English

使用“outer”求解最小二乘估计的大正态方程时内存不足

[英]Out of memory when using `outer` in solving my big normal equation for least squares estimation

Consider the following example in R:考虑以下 R 中的示例:

x1 <- rnorm(100000)
x2 <- rnorm(100000)
g <- cbind(x1, x2, x1^2, x2^2)
gg <- t(g) %*% g
gginv <- solve(gg)
bigmatrix <- outer(x1, x2, "<=")
Gw <- t(g) %*% bigmatrix
beta <- gginv %*% Gw
w1 <- bigmatrix - g %*% beta

If I try to run such a thing in my computer, it will throw a memory error (because the bigmatrix is too big).如果我尝试在我的电脑上运行这样的东西,它会抛出一个内存错误(因为bigmatrix太大了)。

Do you know how can I achieve the same, without running into this problem?你知道我怎样才能在不遇到这个问题的情况下实现同样的目标吗?

This is a least squares problem with 100,000 responses.这是一个有 100,000 个响应的最小二乘问题。 Your bigmatrix is the response (matrix), beta is the coefficient (matrix), while w1 is the residual (matrix).你的bigmatrix是响应(矩阵), beta是系数(矩阵),而w1是残差(矩阵)。

bigmatrix , as well as w1 , if formed explicitly, will each cost bigmatrix ,以及w1 ,如果明确形成,将每个成本

(100,000 * 100,000 * 8) / (1024 ^ 3) = 74.5 GB

This is far too large.这太大了。

As estimation for each response is independent, there is really no need to form bigmatrix in one go and try to store it in RAM.由于每个响应的估计是独立的,因此真的没有必要bigmatrix形成bigmatrix并尝试将其存储在 RAM 中。 We can just form it tile by tile , and use an iterative procedure: form a tile, use a tile, then discard it .我们可以一块一块地形成它,并使用一个迭代过程:形成一个 tile,使用一个 tile,然后丢弃它 For example, the below considers a tile of dimension 100,000 * 2,000 , with memory size:例如,下面考虑一个尺寸为100,000 * 2,000的图块,其内存大小:

(100,000 * 2,000 * 8) / (1024 ^ 3) = 1.5 GB

By such iterative procedure, the memory usage is effectively under control.通过这样的迭代过程,有效地控制了内存使用。

x1 <- rnorm(100000)
x2 <- rnorm(100000)
g <- cbind(x1, x2, x1^2, x2^2)
gg <- crossprod(g)    ## don't use `t(g) %*% g`
## we also don't explicitly form `gg` inverse

## initialize `beta` matrix (4 coefficients for each of 100,000 responses)
beta <- matrix(0, 4, 100000)

## we split 100,000 columns into 50 tiles, each with 2000 columns
for (i in 1:50) {
   start <- 2000 * (i-1) + 1    ## chunk start
   end <- 2000 * i    ## chunk end
   bigmatrix <- outer(x1, x2[start:end], "<=")
   Gw <- crossprod(g, bigmatrix)    ## don't use `t(g) %*% bigmatrix`
   beta[, start:end] <- solve(gg, Gw)
   }

Note, don't try to compute the residual matrix w1 , as It will cost 74.5 GB.请注意,不要尝试计算残差矩阵w1 ,因为它将花费 74.5 GB。 If you need residual matrix in later work, you should still try to break it into tiles and work one by one.如果以后的工作中需要残基,还是尽量把它拆成瓦片,一一处理。

You don't need to worry about the loop here.您无需担心这里的循环。 The computation inside each iteration is costly enough to amortize looping overhead.每次迭代中的计算成本足以分摊循环开销。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM