简体   繁体   English

glmnet - 变量重要性?

[英]glmnet - variable importance?

I´m using the glmnet package to perform a LASSO regression.我正在使用 glmnet 包来执行 LASSO 回归。 Is there a way to get the importance of the individual variables that were selected?有没有办法获得所选单个变量的重要性? I thought about ranking the coefficients that were obtained through the coef(...) command (ie the greater the distance from zero the more important a variable would be).我想过对通过 coef(...) 命令获得的系数进行排序(即离零的距离越大,变量就越重要)。 Would that be a valid approach?这是一个有效的方法吗?

Thanks for your help!谢谢你的帮助!

cvfit = cv.glmnet(x, y, family = "binomial")
coef(cvfit, s = "lambda.min")

## 21 x 1 sparse Matrix of class "dgCMatrix"
##                    1
## (Intercept)  0.14936
## V1           1.32975
## V2           .      
## V3           0.69096
## V4           .      
## V5          -0.83123
## V6           0.53670
## V7           0.02005
## V8           0.33194
## V9           .      
## V10          .      
## V11          0.16239
## V12          .      
## V13          .      
## V14         -1.07081
## V15          .      
## V16          .      
## V17          .      
## V18          .      
## V19          .      
## V20         -1.04341

This is how it is done in caret package.这是在caret包中完成的方式。

To summarize, you can take the absolute value of the final coefficients and rank them.总而言之,您可以取最终系数的绝对值并对它们进行排序。 The ranked coefficients are your variable importance.排名系数是您的变量重要性。

To view the source code, you can type要查看源代码,您可以键入

caret::getModelInfo("glmnet")$glmnet$varImp

If you don't want to use caret package, you can run the following lines from the package, and it should work.如果你不想使用caret包,你可以从包中运行以下几行,它应该可以工作。

varImp <- function(object, lambda = NULL, ...) {

  ## skipping a few lines

  beta <- predict(object, s = lambda, type = "coef")
  if(is.list(beta)) {
    out <- do.call("cbind", lapply(beta, function(x) x[,1]))
    out <- as.data.frame(out, stringsAsFactors = TRUE)
  } else out <- data.frame(Overall = beta[,1])
  out <- abs(out[rownames(out) != "(Intercept)",,drop = FALSE])
  out
}

Finally, call the function with your fit.最后,调用适合您的函数。

varImp(cvfit, lambda = cvfit$lambda.min)

Before you compare the magnitudes of the coefficients you should normalize them by multiplying each coefficent by the standard deviation of the corresponding predictor.在比较系数的大小之前,您应该通过将每个系数乘以相应预测变量的标准差来对它们进行归一化。 This answer has more detail and useful links: https://stats.stackexchange.com/a/211396/34615这个答案有更多的细节和有用的链接: https : //stats.stackexchange.com/a/211396/34615

It's pretty easy to use the contents of the cv.glmnet object to create an ordered list of coefficients...使用 cv.glmnet 对象的内容来创建系数的有序列表非常容易......

coefList <- coef(cv.glmnet.MOD, s='lambda.1se')
coefList <- data.frame(coefList@Dimnames[[1]][coefList@i+1],coefList@x)
names(coefList) <- c('var','val')

coefList %>%
  arrange(-abs(val)) %>%
  print(.,n=25)

NOTE: as other posters have commented...to get a like for like comparison you need to scale/z-score your numeric variables prior to modelling step...otherwise a large coefficient value can be assigned to a variable with a very small scale ie range(0,1) when placed in a model with variables with very large scales ie range(-10000,10000) this will mean that your comparison of coefficient values is not relative and therefore meaningless in most contexts.注意:正如其他海报所评论的那样……要获得类似比较,您需要在建模步骤之前对数字变量进行缩放/z 评分……否则可以将大系数值分配给一个非常小的变量scale ie range(0,1) 当放置在具有非常大比例的变量的模型中时,即 range(-10000,10000) 这将意味着您对系数值的比较不是相对的,因此在大多数情况下毫无意义。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM