如何在 R 中捕获 Bootstrapped 模型中最重要的变量？

Question

I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them.我有几个模型，我想比较它们对同一数据集的重要预测变量的选择，Lasso 就是其中之一。 The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long).我使用的数据集包含人口普查数据和大约一千个变量，为了方便起见，这些变量已重命名为“x1”、“x2”等（原始名称非常长）。 I would like to report the top features then rename these variables with a shorter more concise name.我想报告最重要的特性，然后用更短更简洁的名称重命名这些变量。

My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops.我试图解决这个问题是通过提取每个迭代 model 中的顶级变量，将其放入列表中，然后在 X 次循环中找到顶级变量的平均值。 However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results.但是，我的问题是我仍然发现前 10 个最常用的预测变量存在可变性，因此我无法手动更改变量名称，因为每次在代码块上运行都会产生不同的结果。 I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.我怀疑这是因为我的分析中有太多变量，并且由于 CV 导致每次引导都会创建新模型。

For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.为了一个简单的例子，我使用了 mtcars 并将寻找前 3 个最常见的预测变量，因为这个数据集中只有 10 个变量。

library(glmnet)

data("mtcars") # Base R Dataset
df <- mtcars


topvar <- list()

for (i in 1:100) {
  
  # CV and Splitting
  
  ind <- sample(nrow(df), nrow(df), replace = TRUE)
  ind <- unique(ind)
  
  train <- df[ind, ]
  xtrain <- model.matrix(mpg~., train)[,-1]
  ytrain <- df[ind, 1]
  
  test <- df[-ind, ]
  xtest <- model.matrix(mpg~., test)[,-1]
  ytest <- df[-ind, 1]
  
  # Create Model per Loop
 
  model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2) 
                     
  # Store Coeffecients per loop
  
  coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
  
  # Store all nonzero Coefficients
  
  topvar[[i]] <- coef_las[which(coef_las != 0)]
  
}

# Unlist 

varimp <- unlist(topvar)

# Count all predictors

novar <- table(names(varimp))

# Find the mean of all variables

meanvar <- tapply(varimp, names(varimp), mean)

# Return top 3 repeated Coefs

repvar <- novar[order(novar, decreasing = TRUE)][1:3]

# Return mean of repeated Coefs

repvar.mean <- meanvar[names(repvar)]

repvar

Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run.现在，如果您要重新运行上面的代码块，您会注意到前 3 个变量发生了变化，因此如果我必须重命名这些变量，如果它们不是常量并且每次运行都在变化，那将很难做到。 Any suggestions on how I could approach this?关于我如何处理这个问题有什么建议吗？

Answer 1

You can use function set.seed() to ensure your sample will return the same sample each time.您可以使用 function set.seed() 来确保您的样本每次都会返回相同的样本。 For example例如

set.seed(123)

When I add this to above code and then run twice, the following is returned both times:当我将它添加到上面的代码然后运行两次时，两次都返回以下内容：

  wt carb   hp 
  98   89   86

如何在 R 中捕获 Bootstrapped 模型中最重要的变量？

问题描述

1 个解决方案

解决方案1
0 2022-03-22 03:28:20

如何在 R 中捕获 Bootstrapped 模型中最重要的变量？

问题描述

1 个解决方案

解决方案1 0 2022-03-22 03:28:20

解决方案1
0 2022-03-22 03:28:20