[英]Running random error model with mgcv gam takes too much memory
I am working on a model that includes several REs and a spline for one of the variables, so I am trying to use gam()
.我正在研究 model,其中包括多个 RE 和一个变量的样条曲线,因此我正在尝试使用gam()
。 However, I reach memory exhaust limit error (even when I run it on a cluster with 128GB).但是,我遇到了 memory exhaust limit 错误(即使我在 128GB 的集群上运行它)。 This happens even when I run the simplest of models with just one RE.即使我只用一个 RE 运行最简单的模型,也会发生这种情况。 The same models (minus the spline) run smoothly and in just a few seconds (or minutes for the full model) when I use lmer()
instead.当我改用lmer()
时,相同的模型(减去样条曲线)运行平稳,只需几秒钟(或完整模型的几分钟)。
I was wondering if anyone had any idea why the discrepancy between gam()
and lmer()
and any potential solutions.我想知道是否有人知道为什么gam()
和lmer()
之间存在差异以及任何可能的解决方案。
Here's some code with simulated data and the simplest of models:下面是一些带有模拟数据和最简单模型的代码:
library(mgcv)
library(lme4)
set.seed(1234)
person_n <- 38000 # number of people (grouping variable)
n_j <- 15 # number of data points per person
B1 <- 3 # beta for the main predictor
n <- person_n * n_j
person_id <- gl(person_n, k = n_j) #creating the grouping variable
person_RE <- rep(rnorm(person_n), each = n_j) # creating the random errors
x <- rnorm(n) # creating x as a normal dist centered at 0 and sd = 1
error <- rnorm(n)
#putting it all together
y <- B1 * x + person_RE + error
dat <- data.frame(y, person_id, x)
m1 <- lmer(y ~ x + (1 | person_id), data = dat)
g1 <- gam(y ~ x + s(person_id, bs = "re"), method = "REML", data = dat)
m1
runs in just a couple seconds on my computer, whereas g1
hits the error: m1
在我的电脑上只运行了几秒钟,而g1
遇到了错误:
Error: vector memory exhausted (limit reached?)错误:矢量 memory 耗尽(达到限制?)
From ?mgcv::random.effects
:来自?mgcv::random.effects
:
gam
can be slow for fitting models with large numbers of random effects, because it does not exploit the sparsity that is often a feature of parametric random effects ... However 'gam' is often faster and more reliable than 'gamm' or 'gamm4', when the number of random effects is modest .对于具有大量随机效应的拟合模型,gam
可能会很慢,因为它没有利用参数随机效应通常具有的稀疏性......但是“gam”通常比“gamm”或“gamm4”更快更可靠',当随机效应的数量适中时。 [emphasis added] [强调]
What this means is that in the course of setting up the model, s(., bs = "re")
tries to generate a dense model matrix equivalent to model.matrix( ~ person_id - 1)
;这意味着在设置 model 的过程中, s(., bs = "re")
尝试生成一个密集的 model 矩阵,相当于model.matrix( ~ person_id - 1)
; this takes (nrows x nlevels x 8 bytes/double) = (3.8e4*5.7e5*8)/2^30
= 161.4 Gb (which is exactly the object size that my machine reports it can't allocate).这需要 (nrows x nlevels x 8 bytes/double) = (3.8e4*5.7e5*8)/2^30
= 161.4 Gb(这正是我的机器报告它无法分配的 object 大小)。
Check out mgcv::gamm
and gamm4::gamm4
for more memory-efficient (and faster, in this case) methods...查看mgcv::gamm
和gamm4::gamm4
以获得更节省内存(在本例中更快)的方法...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.