简体   繁体   English

Tidy 模型中随机森林的基于排列的变量重要性(小提琴)图

[英]Permutation based variable importance (violin) plots for random forest in Tidy models

I have built a random forest tidy model very similar to what Julia Silge has done in this video .我已经构建了一个随机森林整洁的 model 非常类似于 Julia Silge 在这个视频中所做的。 I also plan to show variable importance plots based on the permutation method, however I would like to show box plots or violin plots, rather than points.我还计划显示基于排列方法的可变重要性图,但是我想显示箱线图或小提琴图,而不是点。

Here is an example, following Julia's code :这是一个示例,遵循 Julia 的代码

Data and Model Building数据和Model大楼

# DATA
library(tidyverse)
water_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-05-04/water.csv")

# Data prep
water <- water_raw %>%
  filter(
    country_name == "Sierra Leone",
    lat_deg > 0, lat_deg < 15, lon_deg < 0,
    status_id %in% c("y", "n")
  ) %>%
  mutate(pay = case_when(
    str_detect(pay, "^No") ~ "no",
    str_detect(pay, "^Yes") ~ "yes",
    is.na(pay) ~ pay,
    TRUE ~ "it's complicated"
  )) %>%
  select(-country_name, -status, -report_date) %>%
  mutate_if(is.character, as.factor)


library(tidymodels)

set.seed(123)
water_split <- initial_split(water, strata = status_id)
water_train <- training(water_split)
water_test <- testing(water_split)

set.seed(234)
water_folds <- vfold_cv(water_train, strata = status_id)
water_folds


# Model building
library(themis)
ranger_recipe <-
  recipe(formula = status_id ~ ., data = water_train) %>%
  update_role(row_id, new_role = "id") %>%
  step_unknown(all_nominal_predictors()) %>%
  step_other(all_nominal_predictors(), threshold = 0.03) %>%
  step_impute_linear(install_year) %>%
  step_downsample(status_id)

ranger_spec <-
  rand_forest(trees = 1000) %>%
  set_mode("classification") %>%
  set_engine("ranger")

ranger_workflow <-
  workflow() %>%
  add_recipe(ranger_recipe) %>%
  add_model(ranger_spec)

doParallel::registerDoParallel()
set.seed(74403)
ranger_rs <-
  fit_resamples(ranger_workflow,
    resamples = water_folds,
    control = control_resamples(save_pred = TRUE)
  )

Here is Julia's VIP code:这是 Julia 的 VIP 代码:

library(vip)

imp_data <- ranger_recipe %>%
  prep() %>%
  bake(new_data = NULL) %>%
  select(-row_id)


ranger_spec %>%
  set_engine("ranger", importance = "permutation") %>%
  fit(status_id ~ ., data = imp_data) %>%
  vip(geom = "point")

Julia's VIP w points朱莉娅的贵宾 w 积分

My Attempt:我的尝试:

ranger_spec %>%
  set_engine("ranger", importance = "permutation") %>%
  fit(status_id ~ ., data = imp_data) %>%
  vip(pred_wrapper = predict, geom = "boxplot", nsim = 10, keep = TRUE)

However it continues to return this error:但是它继续返回此错误:

Error: To construct boxplots for permutation-based importance scores you must specify keep = TRUE in the call vi() or vi_permute() .错误:要为基于排列的重要性分数构建箱线图,您必须在调用vi()vi_permute()中指定keep = TRUE Additionally, you also need to set nsim >= 2 .此外,您还需要设置nsim >= 2

Because I have done all of those things, I assume my error is with pred_wrapper, but I'm not sure.因为我已经完成了所有这些事情,所以我认为我的错误是 pred_wrapper,但我不确定。 What am I doing wrong here?我在这里做错了什么?

Thanks ya'll!谢谢你们!

First, you may be interested in a resampling approach to estimating variable importance, where you yourself control the resampling and what gets extracted.首先,您可能对估计变量重要性的重采样方法感兴趣,您可以自己控制重采样以及提取的内容。

Second, I think something is not working quite right with method = "permutation" for a tidymodels model.其次,我认为对于 tidymodels model 的method = "permutation" ,有些东西不太正确。 I can't get it to work, but I can get the permutation importance for the underlying model:我无法让它工作,但我可以获得底层 model 的排列重要性:

library(vip)

imp_data <- ranger_recipe %>%
  prep() %>%
  bake(new_data = NULL) %>%
  select(-row_id)

mod <- ranger::ranger(status_id ~ ., data = imp_data, classification = TRUE)

pred_fun = function(object, newdata) {
  predict(object, newdata)$predictions
}

vip(mod, method = "permute",
    train = imp_data, target = "status_id", 
    metric = "accuracy", pred_wrapper = pred_fun)

Created on 2022-09-02 with reprex v2.0.2使用reprex v2.0.2创建于 2022-09-02

Here is another resource for how to use vip , but you may want to look into using DALEX for permutation variable importance .这是有关如何使用 vip 的另一个资源,但您可能希望研究使用DALEX 来获得置换变量的重要性

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM