Permutation based variable importance (violin) plots for random forest in Tidy models

Question

I have built a random forest tidy model very similar to what Julia Silge has done in this video . I also plan to show variable importance plots based on the permutation method, however I would like to show box plots or violin plots, rather than points.

Here is an example, following Julia's code :

Data and Model Building

# DATA
library(tidyverse)
water_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-05-04/water.csv")

# Data prep
water <- water_raw %>%
  filter(
    country_name == "Sierra Leone",
    lat_deg > 0, lat_deg < 15, lon_deg < 0,
    status_id %in% c("y", "n")
  ) %>%
  mutate(pay = case_when(
    str_detect(pay, "^No") ~ "no",
    str_detect(pay, "^Yes") ~ "yes",
    is.na(pay) ~ pay,
    TRUE ~ "it's complicated"
  )) %>%
  select(-country_name, -status, -report_date) %>%
  mutate_if(is.character, as.factor)


library(tidymodels)

set.seed(123)
water_split <- initial_split(water, strata = status_id)
water_train <- training(water_split)
water_test <- testing(water_split)

set.seed(234)
water_folds <- vfold_cv(water_train, strata = status_id)
water_folds


# Model building
library(themis)
ranger_recipe <-
  recipe(formula = status_id ~ ., data = water_train) %>%
  update_role(row_id, new_role = "id") %>%
  step_unknown(all_nominal_predictors()) %>%
  step_other(all_nominal_predictors(), threshold = 0.03) %>%
  step_impute_linear(install_year) %>%
  step_downsample(status_id)

ranger_spec <-
  rand_forest(trees = 1000) %>%
  set_mode("classification") %>%
  set_engine("ranger")

ranger_workflow <-
  workflow() %>%
  add_recipe(ranger_recipe) %>%
  add_model(ranger_spec)

doParallel::registerDoParallel()
set.seed(74403)
ranger_rs <-
  fit_resamples(ranger_workflow,
    resamples = water_folds,
    control = control_resamples(save_pred = TRUE)
  )

Here is Julia's VIP code:

library(vip)

imp_data <- ranger_recipe %>%
  prep() %>%
  bake(new_data = NULL) %>%
  select(-row_id)


ranger_spec %>%
  set_engine("ranger", importance = "permutation") %>%
  fit(status_id ~ ., data = imp_data) %>%
  vip(geom = "point")

Julia's VIP w points

My Attempt:

ranger_spec %>%
  set_engine("ranger", importance = "permutation") %>%
  fit(status_id ~ ., data = imp_data) %>%
  vip(pred_wrapper = predict, geom = "boxplot", nsim = 10, keep = TRUE)

However it continues to return this error:

Error: To construct boxplots for permutation-based importance scores you must specify keep = TRUE in the call vi() or vi_permute() . Additionally, you also need to set nsim >= 2 .

Because I have done all of those things, I assume my error is with pred_wrapper, but I'm not sure. What am I doing wrong here?

Thanks ya'll!

Answer 1

First, you may be interested in a resampling approach to estimating variable importance, where you yourself control the resampling and what gets extracted.

Second, I think something is not working quite right with method = "permutation" for a tidymodels model. I can't get it to work, but I can get the permutation importance for the underlying model:

library(vip)

imp_data <- ranger_recipe %>%
  prep() %>%
  bake(new_data = NULL) %>%
  select(-row_id)

mod <- ranger::ranger(status_id ~ ., data = imp_data, classification = TRUE)

pred_fun = function(object, newdata) {
  predict(object, newdata)$predictions
}

vip(mod, method = "permute",
    train = imp_data, target = "status_id", 
    metric = "accuracy", pred_wrapper = pred_fun)

^{Created on 2022-09-02 with reprex v2.0.2}

Here is another resource for how to use vip , but you may want to look into using DALEX for permutation variable importance .

Permutation based variable importance (violin) plots for random forest in Tidy models

Question

1 answers

solution1
2 2022-09-02 15:49:53

Permutation based variable importance (violin) plots for random forest in Tidy models

Question

1 answers

solution1 2 2022-09-02 15:49:53

solution1
2 2022-09-02 15:49:53