简体   繁体   中英

File size of tidymodels workflow

I'm trying to adopt tidymodels into my processes, but I'm running into a challenge with saving workflows. The file size for workflow objects is many times larger than the data used to build the model, so I end up maxing out my memory when trying to apply the workflow to new data. I can't tell if this is the proper result or if I'm missing something.

In order to make predictions on new data, wouldn't we just need the recipe steps, model coefficients and possibly some summary data from the training set (eg sd and mean of the training data for scaling purposes)? Why then is the workflow object so large?

Here's a simple example using the iris data set. I'm trying to follow from Julia's example here , but the workflow still ends up being 24x larger than the data itself. I know tidymodels has evolved quickly, so perhaps there's a better approach now? Any suggestions are appreciated!

library(tidyverse)
library(tidymodels)
library(lobstr)
library(butcher)

set.seed(8675309)

#Create an indicator for whether the species is Setosa
df <- iris %>% 
    mutate(is_setosa = factor(Species == "setosa"))

#Split into train/test
df_split <- initial_split(df, prop = 0.80)
df_train <- training(df_split)
df_test <- testing(df_split)

#Create the workflow object
my_workflow <- workflow() %>% 
    #use a logistic regression model using glm
    add_model({
        logistic_reg() %>% 
            set_engine("glm")
    }) %>% 
    #Add the recipe
    add_recipe({
        recipe(is_setosa ~ Sepal.Length + Sepal.Width + 
                   Petal.Length + Petal.Width,
               data = df_train) %>% 
            #Add a few arbitrary transformations
            step_log(Sepal.Length) %>% 
            step_mutate(across(matches("Width"),
                               .fns = ~ as.numeric(.x > quantile(.x, 0.9)),
                               .names = "is_{.col}_top_decile")) %>% 
            step_zv(all_predictors()) %>% 
            step_normalize()
    })


#Do a final fit using the workflow.
#The model doesn't converge, but that's not the point.
my_fit <- my_workflow %>% 
    last_fit(df_split)

#How big is our data? 8.3kb
size_data <- df %>% 
    obj_size()

#What's the smallest we can make the workflow? 197kb
size_fit <- my_fit %>% 
    extract_workflow() %>% 
    butcher() %>% 
    obj_size()

#What's the ratio of size between the original data and the fit object?
as.numeric(size_fit / size_data)
#The fit object is 24x bigger than our data.  
#Is that the expected result?


#In order to make predictions on future data, 
#we'd save/load the butchered workflow?
my_fit %>% 
    extract_workflow() %>% 
    butcher() %>% 
    write_rds("my_fit.rds")

It appears this is the expected behavior of using glm() as the model engine. See this GitHub issue for more details; many thanks to Emil & Julia for looking into it.

I switched the model engine from glm to LiblineaR and got a 4x reduction in the file size of the butchered workflow.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM