简体   繁体   English

R中的glmnet()和cv.glmnet()之间的区别?

[英]Difference between glmnet() and cv.glmnet() in R?

I'm working on a project that would show the potential influence a group of events have on an outcome. 我正在开展一个项目,展示一组事件对结果的潜在影响。 I'm using the glmnet() package, specifically using the Poisson feature. 我正在使用glmnet()包,特别是使用Poisson功能。 Here's my code: 这是我的代码:

# de <- data imported from sql connection        
x <- model.matrix(~.,data = de[,2:7])
y <- (de[,1])
reg <- cv.glmnet(x,y, family = "poisson", alpha = 1)
reg1 <- glmnet(x,y, family = "poisson", alpha = 1)

**Co <- coef(?reg or reg1?,s=???)**

summ <- summary(Co)
c <- data.frame(Name= rownames(Co)[summ$i],
       Lambda= summ$x)
c2 <- c[with(c, order(-Lambda)), ]

The beginning imports a large amount of data from my database in SQL. 开始在SQL中从我的数据库导入大量数据。 I then put it in matrix format and separate the response from the predictors. 然后我把它放在矩阵格式中并将响应与预测变量分开。

This is where I'm confused: I can't figure out exactly what the difference is between the glmnet() function and the cv.glmnet() function. 这是我困惑的地方:我无法弄清楚glmnet()函数和cv.glmnet()函数之间的区别。 I realize that the cv.glmnet() function is a k-fold cross-validation of glmnet(), but what exactly does that mean in practical terms? 我意识到cv.glmnet()函数是glmnet()的k-fold交叉验证,但实际上这究竟是什么意思呢? They provide the same value for lambda, but I want to make sure I'm not missing something important about the difference between the two. 它们为lambda提供了相同的值,但我想确保我不会错过两者之间的重要区别。

I'm also unclear as to why it runs fine when I specify alpha=1 (supposedly the default), but not if I leave it out? 当我指定alpha = 1(假设是默认值)时,我也不清楚它为什么运行正常,但如果我把它留下来的话就不行了?

Thanks in advance! 提前致谢!

glmnet() is a R package which can be used to fit Regression models,lasso model and others. glmnet()是一个R包,可用于拟合回归模型,套索模型等。 Alpha argument determines what type of model is fit. Alpha参数确定适合的模型类型。 When alpha=0, Ridge Model is fit and if alpha=1, a lasso model is fit. 当alpha = 0时,Ridge模型是合适的,如果alpha = 1,则套索模型是合适的。

cv.glmnet() performs cross-validation, by default 10-fold which can be adjusted using nfolds. cv.glmnet()执行交叉验证,默认为10倍,可以使用nfolds进行调整。 A 10-fold CV will randomly divide your observations into 10 non-overlapping groups/folds of approx equal size. 10倍的CV将随机将您的观察分成10个非重叠组/大小相等的折叠。 The first fold will be used for validation set and the model is fit on 9 folds. 第一个折叠将用于验证集,模型适合9倍。 Bias Variance advantages is usually the motivation behind using such model validation methods. 偏差方差优势通常是使用此类模型验证方法的动机。 In the case of lasso and ridge models, CV helps choose the value of the tuning parameter lambda. 在套索和脊模型的情况下,CV有助于选择调整参数lambda的值。

In your example, you can do plot(reg) OR reg$lambda.min to see the value of lambda which results in the smallest CV error. 在您的示例中,您可以执行plot(reg)或reg $ lambda.min来查看lambda的值,该值导致最小的CV错误。 You can then derive the Test MSE for that value of lambda. 然后,您可以为该lambda值派生Test MSE。 By default, glmnet() will perform Ridge or Lasso regression for an automatically selected range of lambda which may not give the lowest test MSE. 默认情况下,glmnet()将对自动选择的lambda范围执行Ridge或Lasso回归,这可能不会给出最低的测试MSE。 Hope this helps! 希望这可以帮助!

Hope this helps! 希望这可以帮助!

Between reg$lambda.min and reg$lambda.1se ; 在reg $ lambda.min和reg $ lambda.1se之间; the lambda.min obviously will give you the lowest MSE, however, depending on how flexible you can be with the error, you may want to choose reg$lambda.1se, as this value would further shrink the number of predictors. lambda.min显然会给你最低的MSE,但是,根据你对错误的灵活程度,你可能想选择reg $ lambda.1se,因为这个值会进一步缩小预测变量的数量。 You may also choose the mean of reg$lambda.min and reg$lambda.1se as your lambda value. 您也可以选择reg $ lambda.min和reg $ lambda.1se的平均值作为lambda值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM