[英]How to capture the most important variables in Bootstrapped models in R?
I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them.我有几个模型,我想比较它们对同一数据集的重要预测变量的选择,Lasso 就是其中之一。 The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long).
我使用的数据集包含人口普查数据和大约一千个变量,为了方便起见,这些变量已重命名为“x1”、“x2”等(原始名称非常长)。 I would like to report the top features then rename these variables with a shorter more concise name.
我想报告最重要的特性,然后用更短更简洁的名称重命名这些变量。
My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops.我试图解决这个问题是通过提取每个迭代 model 中的顶级变量,将其放入列表中,然后在 X 次循环中找到顶级变量的平均值。 However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results.
但是,我的问题是我仍然发现前 10 个最常用的预测变量存在可变性,因此我无法手动更改变量名称,因为每次在代码块上运行都会产生不同的结果。 I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.
我怀疑这是因为我的分析中有太多变量,并且由于 CV 导致每次引导都会创建新模型。
For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.为了一个简单的例子,我使用了 mtcars 并将寻找前 3 个最常见的预测变量,因为这个数据集中只有 10 个变量。
library(glmnet)
data("mtcars") # Base R Dataset
df <- mtcars
topvar <- list()
for (i in 1:100) {
# CV and Splitting
ind <- sample(nrow(df), nrow(df), replace = TRUE)
ind <- unique(ind)
train <- df[ind, ]
xtrain <- model.matrix(mpg~., train)[,-1]
ytrain <- df[ind, 1]
test <- df[-ind, ]
xtest <- model.matrix(mpg~., test)[,-1]
ytest <- df[-ind, 1]
# Create Model per Loop
model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2)
# Store Coeffecients per loop
coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
# Store all nonzero Coefficients
topvar[[i]] <- coef_las[which(coef_las != 0)]
}
# Unlist
varimp <- unlist(topvar)
# Count all predictors
novar <- table(names(varimp))
# Find the mean of all variables
meanvar <- tapply(varimp, names(varimp), mean)
# Return top 3 repeated Coefs
repvar <- novar[order(novar, decreasing = TRUE)][1:3]
# Return mean of repeated Coefs
repvar.mean <- meanvar[names(repvar)]
repvar
Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run.现在,如果您要重新运行上面的代码块,您会注意到前 3 个变量发生了变化,因此如果我必须重命名这些变量,如果它们不是常量并且每次运行都在变化,那将很难做到。 Any suggestions on how I could approach this?
关于我如何处理这个问题有什么建议吗?
You can use function set.seed() to ensure your sample will return the same sample each time.您可以使用 function set.seed() 来确保您的样本每次都会返回相同的样本。 For example
例如
set.seed(123)
When I add this to above code and then run twice, the following is returned both times:当我将它添加到上面的代码然后运行两次时,两次都返回以下内容:
wt carb hp
98 89 86
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.