简体   繁体   English

使用LogisticRegression()将R的GLMNET输出与Python进行比较

[英]Comparing the GLMNET output of R with Python using LogisticRegression()

I am using Logistic Regression with the L1 norm (LASSO). 我正在使用L1规范(LASSO)进行Logistic回归。

I have opted to used the glmnet package in R and the LogisticRegression() from the sklearn.linear_model in python . 我已选择使用的glmnet在包RLogisticRegression()sklearn.linear_modelpython From my understanding this should give the same results however they are not. 以我的理解,这应该给出相同的结果,但事实并非如此。

Note that I did not scale my data. 请注意,我没有缩放数据。

For python I have used the below link as a reference: 对于python我已使用以下链接作为参考:

https://chrisalbon.com/machine_learning/logistic_regression/logistic_regression_with_l1_regularization/ https://chrisalbon.com/machine_learning/logistic_regression/logistic_regression_with_l1_regularization/

and for R I have used the below link: 对于R我使用以下链接:

http://www.sthda.com/english/articles/36-classification-methods-essentials/149-penalized-logistic-regression-essentials-in-r-ridge-lasso-and-elastic-net/?fbclid=IwAR0ZTjoGqRgH5vNum9CloeGVaHdwlqDHwDdoGKJXwncOgIT98qUXGcnV70k http://www.sthda.com/zh-CN/articles/36-classification-methods-essentials/149-penalized-logistic-regression-essentials-in-r-ridge-lasso-and-elastic-net/?fbclid=IwAR0ZTjoGqRgH5vNum9CloeGVaHdwlqDH

Here is the code used in R 这是R使用的代码

###################################
#### LASSO LOGISTIC REGRESSION ####
##################################
x <- model.matrix(Y~., Train.Data.SubPop)[,-1]
y <- Train.Data.SubPop$Y
lambda_seq = c(0.0001, 0.01, 0.05, 0.0025)

cv_output <- cv.glmnet(x,y,alpha=1, family = "binomial", lambda = lambda_seq)

cv_output$lambda.min

lasso_best <- glmnet(x,y, alpha = 1, family = "binomial", lambda = cv_output$lambda.min)

Below is my Python code: 下面是我的Python代码:

C = [0.001, 0.01, 0.05, 0.0025]

for c in C:
    clf = LogisticRegression(penalty='l1', C=c, solver='liblinear')
    clf.fit(X_train, y_train)
    print('C:', c)
    print('Coefficient of each feature:', clf.coef_)
    print('Training accuracy:', clf.score(X_train_std, y_train))
    print('Test accuracy:', clf.score(X_test_std, y_test))
    print('')

When I exported the optimal value from the cv.glment() function in R it gave me that the optimal lambda is 0.0001 however, if I look at the analysis from python the best accuracy/precision and recall came from 0.05 . 当我从Rcv.glment()函数导出最优值时,它给出的最优lambda是0.0001但是,如果我从python看分析,则最佳精度/召回率是0.05

I have tried to fit the model with the 0.05 in R and only 1 non-zero coefficient gave me but in phython I had 7. 我试图用0.05的R拟合模型,只有1个非零系数给我,但是在phython我有7个。

can someone help me understand why this discrepancies and difference pleasE? 有人可以帮助我理解为什么这种差异和差异吗?

Also, if someone can guide me how to replicate python code in R it would be very helpful! 另外,如果有人可以指导我如何在R复制python代码,那将非常有帮助!

At a glance I see several issues: 乍一看,我看到了几个问题:

  1. Typo: Looking at your code, in R, your first lambda is 0.0001 . 错别字:查看您的代码,在R中,您的第一个lambda0.0001 In Python, your first C is 0.001 . 在Python中,您的第一个C0.001

  2. Different parameterization: Looking at the documentation, I think there's a clue in the names lambda in R and C in Python being different. 不同的参数化:查看文档,我认为Python中R和C中的lambda名称有所不同。 In glmnet , higher lambda means more shrinkage. glmnet ,更高的lambda意味着更多的收缩。 However, in the sklearn docs C is described as as "the inverse of regularaization strength... smaller values specify stronger regularization". 但是,在sklearn文档中C被描述为“正则化强度的倒数……较小的值表示更强的正则化”。

  3. Scaling: you say, "Note that I did not scale my data." 缩放:您说: “请注意,我没有缩放数据。” This is incorrect. 这是不正确的。 In R, you did. 在R中,您做到了。 There is an glmnet argument standardize for scaling the data, and the default is TRUE . 有一个standardizeglmnet参数用于缩放数据,默认值为TRUE In Python, you didn't. 在Python中,您没有。

  4. Use of cross-validation. 使用交叉验证。 In R, you use cv.glmnet to do k-fold cross-validation on your training set. 在R中,您可以使用cv.glmnet对您的训练集进行k折交叉验证。 In Python, you use LogisticRegression , not LogisticRegressionCV , so there is no cross-validation. 在Python中,您使用LogisticRegression而不是LogisticRegressionCV ,所以没有交叉验证。 Note that cross-validation relies on random sampling, so if you do use CV in both, you should expect the results to be close, but not exact matches. 请注意,交叉验证依赖于随机抽样,因此,如果您在两者中均使用CV,则应期望结果接近,但不完全匹配。

There are possibly other issues too. 可能还有其他问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM