R中的GLM对象占用的内存大于GLM对象的对象大小

Question

I am trying to save multiple GLM objects in a list. 我正在尝试在列表中保存多个GLM对象。 One GLM object is trained on a large dataset, but the size of the object is reduces by setting NULL all the unnecessary data in the GLM object. 一个GLM对象是在大型数据集上训练的，但是通过将GLM对象中所有不必要的数据设置为NULL，可以减小对象的大小。 The problem is that I get RAM issues because R reserves much more RAM than the size of the GLM object. 问题是我遇到了RAM问题，因为R保留的RAM比GLM对象的大小大得多。 Does someone know why this problem occur and how I can solve this? 有人知道为什么会出现此问题以及如何解决吗？ Behind this saving the object results in a larger file than the object size. 保存对象的结果是文件大于对象大小。

Example: 例：

> glm_full <- glm(formula = formule , data = dataset, family = binomial(), model = F, y = F)
> glm_full$data <- glm_full$model <- glm_full$residuals <- glm_full$fitted.values <- glm_full$effects <- glm_full$qr$qr <- glm_full$linear.predictors <- glm_full$weights <- glm_full$prior.weights <- glm_full$y <- NULL
> rm(list= ls()[!(ls() %in% c('glm_full'))])
> object.size(glm_full)
172040 bytes
> gc()
           used  (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells   944802  50.5    3677981  196.5   3862545  206.3
Vcells 83600126 637.9  503881514 3844.4 629722059 4804.4
> rm(glm_full)
> gc()
          used (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells  944208 50.5    2942384  157.2   3862545  206.3
Vcells 4474439 34.2  403105211 3075.5 629722059 4804.4

Here you can see that R reserves RAM for the GLM object, saving multiple GLM objects in the environment results in out of RAM problems. 在这里，您可以看到R为GLM对象保留了RAM，在环境中保存多个GLM对象会导致RAM不足的问题。

Answer 1

A rough explanation for this is that glm hides pointers to the environment and things from the environment deep down inside of the glm object (and in numerous places). 对此的粗略解释是， glm在glm对象内部（以及在许多地方）隐藏了指向环境和环境中事物的指针。

What do you need to be able to do with your glm ? 您需要如何处理glm ？ Even though you've nulled out a lot of the "fat" of the model, your object size will still grow linearly with your data size, and when you compound that by storing multiple glm objects, bumping up against RAM limitations is an obvious concern. 即使您取消了模型的许多“限制”，对象的大小仍将随数据大小线性增长，并且当您通过存储多个glm对象来进行复合处理时，明显需要glm RAM限制。

Here is a function that will allow you to slice away pretty much everything that is non-essential, and the best part is that the glm object size will remain constant regardless of how large your data gets. 这是一个函数，可让您切掉几乎所有不必要的内容，最好的部分是，无论数据多大， glm对象的大小都将保持不变。

stripGlmLR = function(cm) {
  cm$y = c()
  cm$model = c()

  cm$residuals = c()
  cm$fitted.values = c()
  cm$effects = c()
  cm$qr$qr = c()  
  cm$linear.predictors = c()
  cm$weights = c()
  cm$prior.weights = c()
  cm$data = c()


  cm$family$variance = c()
  cm$family$dev.resids = c()
  cm$family$aic = c()
  cm$family$validmu = c()
  cm$family$simulate = c()
  attr(cm$terms,".Environment") = c()
  attr(cm$formula,".Environment") = c()

  cm
}

Some notes: 一些注意事项：

You can null out model$family entirely and the predict function will still return its default value (so, predict(model, newdata = data)) will work). 您可以完全使model$family无效，并且predict函数仍将返回其默认值（因此， predict(model, newdata = data))将起作用）。 However, predict(model, newdata=data, type = 'response') will fail. 但是， predict(model, newdata=data, type = 'response')将失败。 You can recover the response by passing the link value through the inverse link function: in the case of logistic regression, this is the sigmoid function, sigmoid(x) = 1/(1 + exp(-x)) . 您可以通过将链接值传递给逆链接函数来恢复response ：在逻辑回归的情况下，这是S形函数sigmoid(x) = 1/(1 + exp(-x)) 。 (not sure about type = 'terms' ) （不确定type = 'terms' ）

Most importantly, any of the other things besides predict that you might like to do with a glm model will fail on the stripped-down version (so summary() , anova() , and step() are all a no-go). 最重要的是，除predict您可能要使用glm模型外，其他任何事情在精简版本上都将失败（因此summary() ， anova()和step()都是anova() ）。 Thus, you'd be wise to extract all of this info from your glm object and then running the stripGlmLR function. 因此，明智的做法是从glm对象中提取所有这些信息，然后运行stripGlmLR函数。

CREDIT: Nina Zumel for an awesome analysis on glm object memory allocation 信用： Nina Zumel对glm对象内存分配进行了出色的分析

R中的GLM对象占用的内存大于GLM对象的对象大小

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-08-22 18:54:55

R中的GLM对象占用的内存大于GLM对象的对象大小

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-08-22 18:54:55

解决方案1
1 已采纳 2015-08-22 18:54:55