vip 的可变重要性标志与 glmnet / tidymodels 的预期相反

Question

I am using a lasso regression to classify some text as either related to AI or not.我正在使用套索回归将某些文本分类为与 AI 相关或不相关。 When I calculate variable importance using vip and tidymodels , the sign is opposite of expected -- words like "machine", "learning", and "algorithm" have a negative sign.当我使用vip和tidymodels计算变量重要性时，符号与预期相反——像“机器”、“学习”和“算法”这样的词有一个负号。

Apologies for the lack of reprex, but here is my code:为缺少 reprex 道歉，但这是我的代码：

fy21_raw %>%
    sample_n(5)

# A tibble: 5 x 3
#  prog_title     text     artificial_intel
#  <chr>          <chr>    <fct>           
#1 Advanced Batt~ "ABMS l~ not             
#2 Energy Effici~ "This e~ not             
#3 Development o~ "This P~ artificial_intel
#4 Unmanned Logi~ "This U~ artificial_intel
#5 FY 2020 SBIR/~ "Fundin~ not 

# Note: the artificial_intel column is a factor with 2 levels: "artificial_intel" and "not"

set.seed(123)
budget_split <- initial_split(fy21_raw, strata = artificial_intel) 
budget_train <- training(budget_split)
budget_test  <- testing(budget_split)

set.seed(234)
budget_folds <- vfold_cv(budget_train, strata = artificial_intel, v = 5) 

budget_rec <- recipe(artificial_intel ~ ., data = budget_train) %>% # update dv with actual name
    update_role(prog_title, new_role = "id") %>%
    step_tokenize(text) %>%
    step_tokenfilter(text, max_tokens = 1000) %>%
    step_upsample(artificial_intel) %>% # update dv with actual name
    step_tfidf(text) %>%
    step_normalize(recipes::all_predictors())

budget_wf <- workflow() %>%
    add_recipe(budget_rec)

lasso_spec <- logistic_reg(penalty = 0.1, mixture = 1) %>%
    set_mode("classification") %>%
    set_engine("glmnet")

all_cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)

set.seed(1234)
lasso_res <- budget_wf %>%
    add_model(lasso_spec) %>%
    fit_resamples(resamples = budget_folds,
                  metrics = metric_set(roc_auc, accuracy, sens, spec),
                  control = control_grid(save_pred = TRUE, pkgs = c('textrecipes')))

set.seed(123)
budget_imp <- budget_wf %>%
    add_model(lasso_spec) %>%
    fit(budget_train) %>%
    pull_workflow_fit() %>%
    vi()

# A tibble: 1,000 x 3
#   Variable              Importance Sign 
#   <chr>                      <dbl> <chr>
# 1 tfidf_text_machine        -6.82  NEG  
# 2 tfidf_text_artificial     -5.84  NEG  
# 3 tfidf_text_learning       -3.69  NEG

Is it calculating the importance relative to the "not" outcome rather than "artificial_intel"?它是计算相对于“非”结果而不是“artificial_intel”的重要性吗？

Answer 1

From the glmnet vignette:来自 glmnet 小插图：

Note that for "binomial" models, results are returned only for the class corresponding to the second level of the factor response.请注意，对于“二项式”模型，仅返回与因子响应第二级对应的类的结果。

So if you want the right coefficient sign, the positive level with glmnet must be the second.因此，如果您想要正确的系数符号，则 glmnet 的正水平必须是第二个。 If you use glmnet with yardstick, keep in mind that yardstick uses the first factor-level as default.如果您将 glmnet 与 yardstick 一起使用，请记住，yardstick 使用第一个因子级别作为默认值。 Therefore, you need to set yardstick.event_first = FALSE因此，您需要设置 yardstick.event_first = FALSE

vip 的可变重要性标志与 glmnet / tidymodels 的预期相反

问题描述

1 个解决方案

解决方案1
1 2020-11-17 16:58:22

vip 的可变重要性标志与 glmnet / tidymodels 的预期相反

问题描述

1 个解决方案

解决方案1 1 2020-11-17 16:58:22

解决方案1
1 2020-11-17 16:58:22