保存适合的 gam 对象时节省空间（mgcv::gam 和scam::scam）

Question

I am estimating a very simple model on a large dataset.我正在一个大型数据集上估计一个非常简单的模型。 The formula looks like公式看起来像

 scam::scam(formula = ratio ~ s(rate,bs="mpi"))

These models are then used to generate predictions for new data.然后使用这些模型来生成对新数据的预测。 I do not care about anything else about the model.我不关心模型的其他任何事情。

My problem is that the returned object is huge (a few GB), which tends to lead to problems downstream.我的问题是返回的对象很大（几 GB），这往往会导致下游出现问题。

I believe this is due to the fact that scam and gam save the fitted values of each of the million of records.我相信这是因为scam 和gam 保存了数百万条记录中每条记录的拟合值。

Is there a way to only save a small object containing the minimum required to predict on new data?有没有办法只保存一个包含预测新数据所需的最小值的小对象？ This should not be bigger than a few kilobytes.这不应大于几千字节。

huge thanks!非常感谢！

edit1 : here is a reproducible example to show my understanding of Gavin's answer:编辑 1 ：这是一个可重复的示例，以显示我对 Gavin 回答的理解：

library(mgcv)
data(iris)
library(tidyverse)
mydb <- iris %>% filter(Species == "setosa")

dim(mydb) # 50 records
model <-  mgcv::gam(formula = Sepal.Length ~ s(Sepal.Width,bs="cs"), 
                     data  = mydb)

print(object.size(model), units = "KB") # 78 KB

distinct_mydb <- mydb %>% distinct(Sepal.Width) # 16 distinct values for the independent variables
Xp <- predict(model, newdata= distinct_mydb, type = "lpmatrix")
coefs <- coef(model)
dim(Xp) # 16 records and 10 columns (one for each of the 10 knots of the spline?)
preds1 <- Xp %*% coefs %>% t()  
preds2 <- predict(model, newdata= distinct_mydb)  # preds 1 and preds2 are identical

print(object.size(Xp), units = "KB")   # 3.4 Kb
print(object.size(coefs), units = "KB") # 1.1 Kb

In this solution, I would save "Xp" (3.4 Kb) and "coefs" (1.1Kb) for a total of 4.5 Kb instead of saving "model" which takes up 78 Kb在此解决方案中，我将保存“Xp”（3.4 Kb）和“coefs”（1.1Kb）总共 4.5 Kb，而不是保存占用 78 Kb 的“模型”

What I am unsure is how I could use Xp and coefs next week to predict the Sepal.Length of a flower with a never-seen-before Sepal.Width of 2.5 ?我不确定的是下周我如何使用 Xp 和 coefs 来预测花的 Sepal.Length ，其 Sepal.Width 为 2.5 ？

edit2 : Is the answer simply to generate a grid of all possible Sepal.Width (rounded to some decimal) and just left_join this table with any future data?编辑2 ：答案是否只是生成所有可能的Sepal.Width（四舍五入为小数点）的网格，然后将这个表与任何未来数据一起left_join？

fake_db <- data.frame(Sepal.Width = seq(0,max(mydb$Sepal.Width), by = 0.1))
fake_db$predicted_Sepal.Length = predict(model, newdata =  fake_db)
print(object.size(fake_db), units = "KB") # 4.3 Kb

Answer 1

Look at ?mgav:::predict.gam and the information for argument type and in particular "lpmatrix" .查看?mgav:::predict.gam和参数type的信息，特别是"lpmatrix" 。

For example you only need the coefficient vector and the output from例如，您只需要系数向量和来自

predict(model, newdata, type = "lpmatrix")`

where newdata is a much smaller subset of your original data but covering the ranges of the covariates.其中newdata是原始数据的一个小得多的子集，但涵盖了协变量的范围。

This option "lpmatrix" is designed for use downstream or outside of R. The general idea is that given "lpmatrix" as Xp then Xp %*% coef(model) gives fitted values.此选项"lpmatrix"设计用于下游或 R 之外。一般的想法是，给定"lpmatrix"作为Xp然后Xp %*% coef(model)给出拟合值。 But as you can reduce the size of Xp via newdata you can reduce the dimensionality of the object needed for prediction.但是由于您可以通过newdata减少Xp的大小，因此您可以减少预测所需的对象的维度。

保存适合的 gam 对象时节省空间（mgcv::gam 和scam::scam）

问题描述

1 个解决方案

解决方案1
1 2019-01-10 00:57:43

保存适合的 gam 对象时节省空间（mgcv::gam 和scam::scam）

问题描述

1 个解决方案

解决方案1 1 2019-01-10 00:57:43

解决方案1
1 2019-01-10 00:57:43