简体   繁体   English

如何在 R 中的 elasticnet 中使用不同的 set.seed 获得相同的系数?

[英]How can you obtain same coefficients with different set.seed in elasticnet in R?

This question was closed at CrossValidation as it focussed on programming, so it is more suitable here:这个问题在 CrossValidation 上被关闭,因为它专注于编程,所以它更适合这里:

I am running an elastic-net logistic regression on my data.我正在对我的数据运行弹性网络逻辑回归。 I have looked into how to get replicable coefficients every time I run the same model on the same data.我研究了每次在相同数据上运行相同模型时如何获得可复制系数。 But, it does not seem to happen.但是,这似乎不会发生。 I have tried to set nfolds and foldid but once I change the set.seed the coefficients change.我试图设置nfoldsfoldid但是一旦我更改了set.seed系数就会改变。

I understand, how the cross-validation works and how the set.seed potentially change the whole ting.我理解,交叉验证如何工作以及set.seed如何潜在地改变整个 ting。 Some have suggested setting the foldid as done in my code, but it does not help in my case once the set.seed changes.有些人建议像在我的代码中那样设置foldid ,但是一旦set.seed更改, set.seed我的情况没有帮助。

What are the possibilities of getting the same coefficients for every run or statistically sound measure for the model coefficients?为模型系数的每次运行或统计上合理的度量获得相同系数的可能性有多大?

df <- read_csv("data.csv")
View(df)

set.seed(123)
library(caret)
library(tidyverse)
library(glmnet)
library(ROCR)
library(doParallel)
registerDoParallel(4, cores = 4)
training.samples <- df$V1 %>% createDataPartition(p = 0.8, list = FALSE)
train <- df[training.samples, ]
test <- df[-training.samples, ]
x.train <- data.frame(train[, names(train) != "V1"])
x.train <- data.matrix(x.train)
y.train <- train$V1
x.test <- data.frame(test[, names(test) != "V1"])
x.test <- data.matrix(x.test)
y.test <- test$V1
foldid <- sample(rep(seq(10), length.out = nrow(train)))
 
list.of.fits <- list()
for (i in 0:10){
    fit.name <- paste0("alpha", i/10) 
    list.of.fits[[fit.name]] <- cv.glmnet(x.train, y.train, type.measure = "dev",
    alpha = i/10, family = "binomial", nfolds = 10, foldid = foldid, parallel = TRUE)
}
coef <- coef(list.of.fits[[fit.name]], s = list.of.fits[[fit.name]]$lambda.1se)
coef

My output ends up like this:我的输出最终是这样的:

set.seed(123)

(Intercept) -18.533050
V2          -0.0049142
V3          -0.0013228
V4          -0.0029664
V5           0.0123987
V6           0.1433817
V7           .           
V8          -0.0188888
V9           0.0007504
V10         -0.0626482
set.seed(42)

(Intercept) -22.16271709
V2          -0.005898701
V3          -0.001332854
V4          -0.003506514
V5           0.013343484
V6           0.097911065
V7          -0.269346185
V8          -0.024876785
V9           0.027937690
V10         -0.070759818

There are two different type of reproducibility.有两种不同类型的再现性。 One is setting the seed, so that one can get the same output for cv.glmnet.一种是设置种子,以便可以为 cv.glmnet 获得相同的输出。 And this is what you have done, setting the seed, and providing the folds as input.这就是你所做的,设置种子,并提供折叠作为输入。 So if you run the code again with the same seed, it would give you the same results.因此,如果您使用相同的种子再次运行代码,它会给您相同的结果。

In your question and example, you are running it with different cross-validation folds and extracting the coefficient with the 1se.在您的问题和示例中,您使用不同的交叉验证折叠运行它并使用 1se 提取系数。 The errors for each lambda value in each sampling will of course differ, which is the purpose of cross-validation, but you would not expect the lambdas to be too different.每次采样中每个 lambda 值的误差当然会有所不同,这是交叉验证的目的,但您不会期望 lambda 值相差太大。

We can look at this with an example dataset:我们可以用一个示例数据集来看看这个:

library(glmnet)
library(mlbench)
data(Sonar)
set.seed(111)
idx = sample(nrow(Sonar),150)
x.train = as.matrix(Sonar[idx,1:60])
y.train = as.numeric(Sonar$Class)[idx]
x.test = as.matrix(Sonar[-idx,1:60])
y.test = as.numeric(Sonar$Class)[-idx]

Use 5 different seeds, 1 constant alpha:使用 5 个不同的种子,1 个常量 alpha:

o = lapply(1:5,function(i){
set.seed(i)
foldid <- sample(rep(seq(10), length.out = nrow(x.train)))
fit = cv.glmnet(x.train, y.train, type.measure = "dev",
         alpha = 0.5, nfolds = 10, family="binomial",foldid = foldid)

wh = which(fit$lambda==fit$lambda.1se)
data.frame(seed=i,lambda = fit$lambda.1se,error = fit$cvm[wh],
           hi = fit$cvup[wh] , lo = fit$cvlo[wh])
})

See that the lambda 1 se does not differ so much, and also the error:看到 lambda 1 se 差别不大,还有错误:

do.call(rbind,o)
  seed     lambda    error       hi       lo
1    1 0.08047217 1.020349 1.071294 0.969404
2    2 0.14062741 1.099030 1.148053 1.050007
3    3 0.12813445 1.101091 1.160062 1.042121
4    4 0.11675134 1.104327 1.165262 1.043392
5    5 0.10637948 1.059930 1.114897 1.004962

If your data is large enough, you will see these lambda and error values getting closer.如果您的数据足够大,您会看到这些 lambda 和错误值越来越接近。 So using one of these lambdas should be good enough to give you a prediction that would minimize the error.因此,使用这些 lambda 表达式中的一个应该足以为您提供可以最小化错误的预测。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM