简体   繁体   中英

Saving space when saving a fitted gam object (mgcv::gam and scam::scam)

I am estimating a very simple model on a large dataset. The formula looks like

 scam::scam(formula = ratio ~ s(rate,bs="mpi"))

These models are then used to generate predictions for new data. I do not care about anything else about the model.

My problem is that the returned object is huge (a few GB), which tends to lead to problems downstream.

I believe this is due to the fact that scam and gam save the fitted values of each of the million of records.

Is there a way to only save a small object containing the minimum required to predict on new data? This should not be bigger than a few kilobytes.

huge thanks!

edit1 : here is a reproducible example to show my understanding of Gavin's answer:

library(mgcv)
data(iris)
library(tidyverse)
mydb <- iris %>% filter(Species == "setosa")

dim(mydb) # 50 records
model <-  mgcv::gam(formula = Sepal.Length ~ s(Sepal.Width,bs="cs"), 
                     data  = mydb)

print(object.size(model), units = "KB") # 78 KB

distinct_mydb <- mydb %>% distinct(Sepal.Width) # 16 distinct values for the independent variables
Xp <- predict(model, newdata= distinct_mydb, type = "lpmatrix")
coefs <- coef(model)
dim(Xp) # 16 records and 10 columns (one for each of the 10 knots of the spline?)
preds1 <- Xp %*% coefs %>% t()  
preds2 <- predict(model, newdata= distinct_mydb)  # preds 1 and preds2 are identical

print(object.size(Xp), units = "KB")   # 3.4 Kb
print(object.size(coefs), units = "KB") # 1.1 Kb

In this solution, I would save "Xp" (3.4 Kb) and "coefs" (1.1Kb) for a total of 4.5 Kb instead of saving "model" which takes up 78 Kb

What I am unsure is how I could use Xp and coefs next week to predict the Sepal.Length of a flower with a never-seen-before Sepal.Width of 2.5 ?

edit2 : Is the answer simply to generate a grid of all possible Sepal.Width (rounded to some decimal) and just left_join this table with any future data?

fake_db <- data.frame(Sepal.Width = seq(0,max(mydb$Sepal.Width), by = 0.1))
fake_db$predicted_Sepal.Length = predict(model, newdata =  fake_db)
print(object.size(fake_db), units = "KB") # 4.3 Kb

Look at ?mgav:::predict.gam and the information for argument type and in particular "lpmatrix" .

For example you only need the coefficient vector and the output from

predict(model, newdata, type = "lpmatrix")`

where newdata is a much smaller subset of your original data but covering the ranges of the covariates.

This option "lpmatrix" is designed for use downstream or outside of R. The general idea is that given "lpmatrix" as Xp then Xp %*% coef(model) gives fitted values. But as you can reduce the size of Xp via newdata you can reduce the dimensionality of the object needed for prediction.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM