简体   繁体   English

如何在 R 中捕获 Bootstrapped 模型中最重要的变量?

[英]How to capture the most important variables in Bootstrapped models in R?

I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them.我有几个模型,我想比较它们对同一数据集的重要预测变量的选择,Lasso 就是其中之一。 The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long).我使用的数据集包含人口普查数据和大约一千个变量,为了方便起见,这些变量已重命名为“x1”、“x2”等(原始名称非常长)。 I would like to report the top features then rename these variables with a shorter more concise name.我想报告最重要的特性,然后用更短更简洁的名称重命名这些变量。

My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops.我试图解决这个问题是通过提取每个迭代 model 中的顶级变量,将其放入列表中,然后在 X 次循环中找到顶级变量的平均值。 However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results.但是,我的问题是我仍然发现前 10 个最常用的预测变量存在可变性,因此我无法手动更改变量名称,因为每次在代码块上运行都会产生不同的结果。 I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.我怀疑这是因为我的分析中有太多变量,并且由于 CV 导致每次引导都会创建新模型。

For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.为了一个简单的例子,我使用了 mtcars 并将寻找前 3 个最常见的预测变量,因为这个数据集中只有 10 个变量。

library(glmnet)

data("mtcars") # Base R Dataset
df <- mtcars


topvar <- list()

for (i in 1:100) {
  
  # CV and Splitting
  
  ind <- sample(nrow(df), nrow(df), replace = TRUE)
  ind <- unique(ind)
  
  train <- df[ind, ]
  xtrain <- model.matrix(mpg~., train)[,-1]
  ytrain <- df[ind, 1]
  
  test <- df[-ind, ]
  xtest <- model.matrix(mpg~., test)[,-1]
  ytest <- df[-ind, 1]
  
  # Create Model per Loop
 
  model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2) 
                     
  # Store Coeffecients per loop
  
  coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
  
  # Store all nonzero Coefficients
  
  topvar[[i]] <- coef_las[which(coef_las != 0)]
  
}

# Unlist 

varimp <- unlist(topvar)

# Count all predictors

novar <- table(names(varimp))

# Find the mean of all variables

meanvar <- tapply(varimp, names(varimp), mean)

# Return top 3 repeated Coefs

repvar <- novar[order(novar, decreasing = TRUE)][1:3]

# Return mean of repeated Coefs

repvar.mean <- meanvar[names(repvar)]

repvar

Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run.现在,如果您要重新运行上面的代码块,您会注意到前 3 个变量发生了变化,因此如果我必须重命名这些变量,如果它们不是常量并且每次运行都在变化,那将很难做到。 Any suggestions on how I could approach this?关于我如何处理这个问题有什么建议吗?

You can use function set.seed() to ensure your sample will return the same sample each time.您可以使用 function set.seed() 来确保您的样本每次都会返回相同的样本。 For example例如

set.seed(123)

When I add this to above code and then run twice, the following is returned both times:当我将它添加到上面的代码然后运行两次时,两次都返回以下内容:

  wt carb   hp 
  98   89   86

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 R 中获得自举模型或具有稳健 SE 的模型的标准化回归系数 - How to obtain standardized regression coefficients for bootstrapped models or models with robust SE in R 与 R 中超过 2 个变量的自举相关性 - Bootstrapped correlation with more than 2 variables in R R 中有没有办法确定变量中的哪些水平在 GBM 预测模型中最重要? - Is there a way in R to determine which levels within the variables are most important in the GBM predictive model? 如何从R中的stepAIC提取重要变量到Excel工作表? - How to extract the important variables from stepAIC in R to an excel sheet? 寻找组成员的最重要变量 - Most important variables for finding group membership 如何加速 R 中的自举向量生成 - how to speed up a bootstrapped vector generation in R R 中的自举相关性 - Bootstrapped correlation in R 如何为 R 中的分类数据生成自举置信区间? - How to generate bootstrapped confidence intervals for categorical data in R? 如何在 R 中执行引导配对 t 检验? - How to perform a bootstrapped paired t-test in R? 如何计算R中的滚动自举值和置信区间 - How to calculate rolling bootstrapped values and confidence intervals in R
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM