简体   繁体   English

vip 的可变重要性标志与 glmnet / tidymodels 的预期相反

[英]Variable importance signs from vip are opposite of expected from glmnet / tidymodels

I am using a lasso regression to classify some text as either related to AI or not.我正在使用套索回归将某些文本分类为与 AI 相关或不相关。 When I calculate variable importance using vip and tidymodels , the sign is opposite of expected -- words like "machine", "learning", and "algorithm" have a negative sign.当我使用viptidymodels计算变量重要性时,符号与预期相反——像“机器”、“学习”和“算法”这样的词有一个负号。

Apologies for the lack of reprex, but here is my code:为缺少 reprex 道歉,但这是我的代码:

fy21_raw %>%
    sample_n(5)

# A tibble: 5 x 3
#  prog_title     text     artificial_intel
#  <chr>          <chr>    <fct>           
#1 Advanced Batt~ "ABMS l~ not             
#2 Energy Effici~ "This e~ not             
#3 Development o~ "This P~ artificial_intel
#4 Unmanned Logi~ "This U~ artificial_intel
#5 FY 2020 SBIR/~ "Fundin~ not 

# Note: the artificial_intel column is a factor with 2 levels: "artificial_intel" and "not"

set.seed(123)
budget_split <- initial_split(fy21_raw, strata = artificial_intel) 
budget_train <- training(budget_split)
budget_test  <- testing(budget_split)

set.seed(234)
budget_folds <- vfold_cv(budget_train, strata = artificial_intel, v = 5) 

budget_rec <- recipe(artificial_intel ~ ., data = budget_train) %>% # update dv with actual name
    update_role(prog_title, new_role = "id") %>%
    step_tokenize(text) %>%
    step_tokenfilter(text, max_tokens = 1000) %>%
    step_upsample(artificial_intel) %>% # update dv with actual name
    step_tfidf(text) %>%
    step_normalize(recipes::all_predictors())

budget_wf <- workflow() %>%
    add_recipe(budget_rec)

lasso_spec <- logistic_reg(penalty = 0.1, mixture = 1) %>%
    set_mode("classification") %>%
    set_engine("glmnet")

all_cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)

set.seed(1234)
lasso_res <- budget_wf %>%
    add_model(lasso_spec) %>%
    fit_resamples(resamples = budget_folds,
                  metrics = metric_set(roc_auc, accuracy, sens, spec),
                  control = control_grid(save_pred = TRUE, pkgs = c('textrecipes')))

set.seed(123)
budget_imp <- budget_wf %>%
    add_model(lasso_spec) %>%
    fit(budget_train) %>%
    pull_workflow_fit() %>%
    vi()

# A tibble: 1,000 x 3
#   Variable              Importance Sign 
#   <chr>                      <dbl> <chr>
# 1 tfidf_text_machine        -6.82  NEG  
# 2 tfidf_text_artificial     -5.84  NEG  
# 3 tfidf_text_learning       -3.69  NEG

Is it calculating the importance relative to the "not" outcome rather than "artificial_intel"?它是计算相对于“非”结果而不是“artificial_intel”的重要性吗?

From the glmnet vignette:来自 glmnet 小插图:

Note that for "binomial" models, results are returned only for the class corresponding to the second level of the factor response.请注意,对于“二项式”模型,仅返回与因子响应第二级对应的类的结果。

So if you want the right coefficient sign, the positive level with glmnet must be the second.因此,如果您想要正确的系数符号,则 glmnet 的正水平必须是第二个。 If you use glmnet with yardstick, keep in mind that yardstick uses the first factor-level as default.如果您将 glmnet 与 yardstick 一起使用,请记住,yardstick 使用第一个因子级别作为默认值。 Therefore, you need to set yardstick.event_first = FALSE因此,您需要设置 yardstick.event_first = FALSE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM