[英]GLM object in R takes more RAM than the object size of the GLM object
I am trying to save multiple GLM objects in a list. 我正在尝试在列表中保存多个GLM对象。 One GLM object is trained on a large dataset, but the size of the object is reduces by setting NULL all the unnecessary data in the GLM object.
一个GLM对象是在大型数据集上训练的,但是通过将GLM对象中所有不必要的数据设置为NULL,可以减小对象的大小。 The problem is that I get RAM issues because R reserves much more RAM than the size of the GLM object.
问题是我遇到了RAM问题,因为R保留的RAM比GLM对象的大小大得多。 Does someone know why this problem occur and how I can solve this?
有人知道为什么会出现此问题以及如何解决吗? Behind this saving the object results in a larger file than the object size.
保存对象的结果是文件大于对象大小。
Example: 例:
> glm_full <- glm(formula = formule , data = dataset, family = binomial(), model = F, y = F)
> glm_full$data <- glm_full$model <- glm_full$residuals <- glm_full$fitted.values <- glm_full$effects <- glm_full$qr$qr <- glm_full$linear.predictors <- glm_full$weights <- glm_full$prior.weights <- glm_full$y <- NULL
> rm(list= ls()[!(ls() %in% c('glm_full'))])
> object.size(glm_full)
172040 bytes
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 944802 50.5 3677981 196.5 3862545 206.3
Vcells 83600126 637.9 503881514 3844.4 629722059 4804.4
> rm(glm_full)
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 944208 50.5 2942384 157.2 3862545 206.3
Vcells 4474439 34.2 403105211 3075.5 629722059 4804.4
Here you can see that R reserves RAM for the GLM object, saving multiple GLM objects in the environment results in out of RAM problems. 在这里,您可以看到R为GLM对象保留了RAM,在环境中保存多个GLM对象会导致RAM不足的问题。
A rough explanation for this is that glm
hides pointers to the environment and things from the environment deep down inside of the glm
object (and in numerous places). 对此的粗略解释是,
glm
在glm
对象内部(以及在许多地方)隐藏了指向环境和环境中事物的指针。
What do you need to be able to do with your glm
? 您需要如何处理
glm
? Even though you've nulled out a lot of the "fat" of the model, your object size will still grow linearly with your data size, and when you compound that by storing multiple glm
objects, bumping up against RAM limitations is an obvious concern. 即使您取消了模型的许多“限制”,对象的大小仍将随数据大小线性增长,并且当您通过存储多个
glm
对象来进行复合处理时,明显需要glm
RAM限制。
Here is a function that will allow you to slice away pretty much everything that is non-essential, and the best part is that the glm
object size will remain constant regardless of how large your data gets. 这是一个函数,可让您切掉几乎所有不必要的内容,最好的部分是,无论数据多大,
glm
对象的大小都将保持不变。
stripGlmLR = function(cm) {
cm$y = c()
cm$model = c()
cm$residuals = c()
cm$fitted.values = c()
cm$effects = c()
cm$qr$qr = c()
cm$linear.predictors = c()
cm$weights = c()
cm$prior.weights = c()
cm$data = c()
cm$family$variance = c()
cm$family$dev.resids = c()
cm$family$aic = c()
cm$family$validmu = c()
cm$family$simulate = c()
attr(cm$terms,".Environment") = c()
attr(cm$formula,".Environment") = c()
cm
}
Some notes: 一些注意事项:
You can null out model$family
entirely and the predict
function will still return its default value (so, predict(model, newdata = data))
will work). 您可以完全使
model$family
无效,并且predict
函数仍将返回其默认值(因此, predict(model, newdata = data))
将起作用)。 However, predict(model, newdata=data, type = 'response')
will fail. 但是,
predict(model, newdata=data, type = 'response')
将失败。 You can recover the response
by passing the link value through the inverse link function: in the case of logistic regression, this is the sigmoid function, sigmoid(x) = 1/(1 + exp(-x))
. 您可以通过将链接值传递给逆链接函数来恢复
response
:在逻辑回归的情况下,这是S形函数sigmoid(x) = 1/(1 + exp(-x))
。 (not sure about type = 'terms'
) (不确定
type = 'terms'
)
Most importantly, any of the other things besides predict
that you might like to do with a glm
model will fail on the stripped-down version (so summary()
, anova()
, and step()
are all a no-go). 最重要的是,除
predict
您可能要使用glm
模型外,其他任何事情在精简版本上都将失败(因此summary()
, anova()
和step()
都是anova()
)。 Thus, you'd be wise to extract all of this info from your glm
object and then running the stripGlmLR
function. 因此,明智的做法是从
glm
对象中提取所有这些信息,然后运行stripGlmLR
函数。
CREDIT: Nina Zumel for an awesome analysis on glm
object memory allocation 信用: Nina Zumel对
glm
对象内存分配进行了出色的分析
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.